Compare commits
No commits in common. "fc688ddde422bdba2b53a806188b82235c502433" and "decec67b78020abd5d7c5bf22788ee02c04aad2d" have entirely different histories.
fc688ddde4
...
decec67b78
128
CODE_OF_CONDUCT.md
Normal file
@ -0,0 +1,128 @@
|
||||
# Contributor Covenant Code of Conduct
|
||||
|
||||
## Our Pledge
|
||||
|
||||
We as members, contributors, and leaders pledge to make participation in our
|
||||
community a harassment-free experience for everyone, regardless of age, body
|
||||
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||
identity and expression, level of experience, education, socio-economic status,
|
||||
nationality, personal appearance, race, religion, or sexual identity
|
||||
and orientation.
|
||||
|
||||
We pledge to act and interact in ways that contribute to an open, welcoming,
|
||||
diverse, inclusive, and healthy community.
|
||||
|
||||
## Our Standards
|
||||
|
||||
Examples of behavior that contributes to a positive environment for our
|
||||
community include:
|
||||
|
||||
* Demonstrating empathy and kindness toward other people
|
||||
* Being respectful of differing opinions, viewpoints, and experiences
|
||||
* Giving and gracefully accepting constructive feedback
|
||||
* Accepting responsibility and apologizing to those affected by our mistakes,
|
||||
and learning from the experience
|
||||
* Focusing on what is best not just for us as individuals, but for the
|
||||
overall community
|
||||
|
||||
Examples of unacceptable behavior include:
|
||||
|
||||
* The use of sexualized language or imagery, and sexual attention or
|
||||
advances of any kind
|
||||
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||
* Public or private harassment
|
||||
* Publishing others' private information, such as a physical or email
|
||||
address, without their explicit permission
|
||||
* Other conduct which could reasonably be considered inappropriate in a
|
||||
professional setting
|
||||
|
||||
## Enforcement Responsibilities
|
||||
|
||||
Community leaders are responsible for clarifying and enforcing our standards of
|
||||
acceptable behavior and will take appropriate and fair corrective action in
|
||||
response to any behavior that they deem inappropriate, threatening, offensive,
|
||||
or harmful.
|
||||
|
||||
Community leaders have the right and responsibility to remove, edit, or reject
|
||||
comments, commits, code, wiki edits, issues, and other contributions that are
|
||||
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
||||
decisions when appropriate.
|
||||
|
||||
## Scope
|
||||
|
||||
This Code of Conduct applies within all community spaces, and also applies when
|
||||
an individual is officially representing the community in public spaces.
|
||||
Examples of representing our community include using an official e-mail address,
|
||||
posting via an official social media account, or acting as an appointed
|
||||
representative at an online or offline event.
|
||||
|
||||
## Enforcement
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported to the community leaders responsible for enforcement at
|
||||
.
|
||||
All complaints will be reviewed and investigated promptly and fairly.
|
||||
|
||||
All community leaders are obligated to respect the privacy and security of the
|
||||
reporter of any incident.
|
||||
|
||||
## Enforcement Guidelines
|
||||
|
||||
Community leaders will follow these Community Impact Guidelines in determining
|
||||
the consequences for any action they deem in violation of this Code of Conduct:
|
||||
|
||||
### 1. Correction
|
||||
|
||||
**Community Impact**: Use of inappropriate language or other behavior deemed
|
||||
unprofessional or unwelcome in the community.
|
||||
|
||||
**Consequence**: A private, written warning from community leaders, providing
|
||||
clarity around the nature of the violation and an explanation of why the
|
||||
behavior was inappropriate. A public apology may be requested.
|
||||
|
||||
### 2. Warning
|
||||
|
||||
**Community Impact**: A violation through a single incident or series
|
||||
of actions.
|
||||
|
||||
**Consequence**: A warning with consequences for continued behavior. No
|
||||
interaction with the people involved, including unsolicited interaction with
|
||||
those enforcing the Code of Conduct, for a specified period of time. This
|
||||
includes avoiding interactions in community spaces as well as external channels
|
||||
like social media. Violating these terms may lead to a temporary or
|
||||
permanent ban.
|
||||
|
||||
### 3. Temporary Ban
|
||||
|
||||
**Community Impact**: A serious violation of community standards, including
|
||||
sustained inappropriate behavior.
|
||||
|
||||
**Consequence**: A temporary ban from any sort of interaction or public
|
||||
communication with the community for a specified period of time. No public or
|
||||
private interaction with the people involved, including unsolicited interaction
|
||||
with those enforcing the Code of Conduct, is allowed during this period.
|
||||
Violating these terms may lead to a permanent ban.
|
||||
|
||||
### 4. Permanent Ban
|
||||
|
||||
**Community Impact**: Demonstrating a pattern of violation of community
|
||||
standards, including sustained inappropriate behavior, harassment of an
|
||||
individual, or aggression toward or disparagement of classes of individuals.
|
||||
|
||||
**Consequence**: A permanent ban from any sort of public interaction within
|
||||
the community.
|
||||
|
||||
## Attribution
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||
version 2.0, available at
|
||||
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
|
||||
|
||||
Community Impact Guidelines were inspired by [Mozilla's code of conduct
|
||||
enforcement ladder](https://github.com/mozilla/diversity).
|
||||
|
||||
[homepage]: https://www.contributor-covenant.org
|
||||
|
||||
For answers to common questions about this code of conduct, see the FAQ at
|
||||
https://www.contributor-covenant.org/faq. Translations are available at
|
||||
https://www.contributor-covenant.org/translations.
|
199
README.md
Normal file
@ -0,0 +1,199 @@
|
||||
<div align="center">
|
||||
|
||||

|
||||
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
|
||||

|
||||
[](https://github.com/jingyaogong/minimind/stargazers)
|
||||
[](LICENSE)
|
||||
[](https://github.com/jingyaogong/minimind/commits/master)
|
||||
[](https://github.com/jingyaogong/minimind/pulls)
|
||||
[](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
# 📌 数据介绍
|
||||
|
||||
## Ⅰ Tokenizer
|
||||
|
||||
分词器将单词从自然语言通过“词典”映射到`0, 1, 36`这样的数字,可以理解为数字就代表了单词在“词典”中的页码。
|
||||
可以选择自己构造词表训练一个“词典”,代码可见`./scripts/train_tokenizer.py`(仅供学习参考,若非必要无需再自行训练,MiniMind已自带tokenizer)。
|
||||
或者选择比较出名的开源大模型分词器,
|
||||
正如同直接用新华/牛津词典的优点是token编码压缩率很好,缺点是页数太多,动辄数十万个词汇短语;
|
||||
自己训练的分词器,优点是词表长度和内容随意控制,缺点是压缩率很低(例如"hello"也许会被拆分为"h e l l o"
|
||||
五个独立的token),且生僻词难以覆盖。
|
||||
“词典”的选择固然很重要,LLM的输出本质上是SoftMax到词典N个词的多分类问题,然后通过“词典”解码到自然语言。
|
||||
因为MiniMind体积需要严格控制,为了避免模型头重脚轻(词嵌入embedding层参数在LLM占比太高),所以词表长度短短益善。
|
||||
|
||||
<details style="color:rgb(128,128,128)">
|
||||
<summary>Tokenizer介绍</summary>
|
||||
|
||||
第三方强大的开源模型例如Yi、qwen、chatglm、mistral、Llama3的tokenizer词表长度如下:
|
||||
|
||||
<table>
|
||||
<tr><th>Tokenizer模型</th><th>词表大小</th><th>来源</th></tr>
|
||||
<tr><td>yi tokenizer</td><td>64,000</td><td>01万物(中国)</td></tr>
|
||||
<tr><td>qwen2 tokenizer</td><td>151,643</td><td>阿里云(中国)</td></tr>
|
||||
<tr><td>glm tokenizer</td><td>151,329</td><td>智谱AI(中国)</td></tr>
|
||||
<tr><td>mistral tokenizer</td><td>32,000</td><td>Mistral AI(法国)</td></tr>
|
||||
<tr><td>llama3 tokenizer</td><td>128,000</td><td>Meta(美国)</td></tr>
|
||||
<tr><td>minimind tokenizer</td><td>6,400</td><td>自定义</td></tr>
|
||||
</table>
|
||||
|
||||
> 👉2024-09-17更新:为了防止过去的版本歧义&控制体积,minimind所有模型均使用minimind_tokenizer分词,废弃所有mistral_tokenizer版本。
|
||||
|
||||
```
|
||||
# 一些自言自语
|
||||
> 尽管minimind_tokenizer长度很小,编解码效率弱于qwen2、glm等中文友好型分词器。
|
||||
> 但minimind模型选择了自己训练的minimind_tokenizer作为分词器,以保持整体参数轻量,避免编码层和计算层占比失衡,头重脚轻,因为minimind的词表大小只有6400。
|
||||
> 且minimind在实际测试中没有出现过生僻词汇解码失败的情况,效果良好。
|
||||
> 由于自定义词表压缩长度到6400,使得LLM总参数量最低只有25.8M。
|
||||
> 训练数据`tokenizer_train.jsonl`均来自于`匠数大模型数据集`,这部分数据相对次要,如需训练可以自由选择。
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Ⅱ Pretrain数据
|
||||
|
||||
经历了MiniMind-V1的低质量预训练数据,导致模型胡言乱语的教训,`2025-02-05` 之后决定不再采用大规模无监督的数据集做预训练。
|
||||
进而尝试把[匠数大模型数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)的中文部分提取出来,
|
||||
清洗出字符`<512`长度的大约1.6GB的语料直接拼接成预训练数据 `pretrain_hq.jsonl`,hq即为high
|
||||
quality(当然也还不算high,提升数据质量无止尽)。
|
||||
|
||||
文件`pretrain_hq.jsonl` 数据格式为
|
||||
|
||||
```bash
|
||||
{"text": "如何才能摆脱拖延症? 治愈拖延症并不容易,但以下建议可能有所帮助..."}
|
||||
```
|
||||
|
||||
## Ⅲ SFT数据
|
||||
|
||||
[匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)
|
||||
“是一个完整、格式统一、安全的大模型训练和研究资源。
|
||||
从网络上的公开数据源收集并整理了大量开源数据集,对其进行了格式统一,数据清洗,
|
||||
包含10M条数据的中文数据集和包含2M条数据的英文数据集。”
|
||||
以上是官方介绍,下载文件后的数据总量大约在4B tokens,肯定是适合作为中文大语言模型的SFT数据的。
|
||||
但是官方提供的数据格式很乱,全部用来sft代价太大。
|
||||
我将把官方数据集进行了二次清洗,把含有符号污染和噪声的条目去除;另外依然只保留了总长度`<512`
|
||||
的内容,此阶段希望通过大量对话补充预训练阶段欠缺的知识。
|
||||
导出文件为`sft_512.jsonl`(~7.5GB)。
|
||||
|
||||
[Magpie-SFT数据集](https://www.modelscope.cn/organization/Magpie-Align)
|
||||
收集了~1M条来自Qwen2/2.5的高质量对话,我将这部分数据进一步清洗,把总长度`<2048`的部分导出为`sft_2048.jsonl`(~9GB)。
|
||||
长度`<1024`的部分导出为`sft_1024.jsonl`(~5.5GB),用大模型对话数据直接进行sft就属于“黑盒蒸馏”的范畴。
|
||||
|
||||
进一步清洗前两步sft的数据(只保留中文字符占比高的内容),筛选长度`<512`的对话,得到`sft_mini_512.jsonl`(~1.2GB)。
|
||||
|
||||
所有sft文件 `sft_X.jsonl` 数据格式均为
|
||||
|
||||
```text
|
||||
{
|
||||
"conversations": [
|
||||
{"role": "user", "content": "你好"},
|
||||
{"role": "assistant", "content": "你好!"},
|
||||
{"role": "user", "content": "再见"},
|
||||
{"role": "assistant", "content": "再见!"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Ⅳ RLHF数据
|
||||
|
||||
来自[Magpie-DPO数据集](https://www.modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1)
|
||||
大约200k条偏好数据(均是英文)生成自Llama3.1-70B/8B,可以用于训练奖励模型,优化模型回复质量,使其更加符合人类偏好。
|
||||
这里将数据总长度`<3000`的内容重组为`dpo.jsonl`(~0.9GB),包含`chosen`和`rejected`两个字段,`chosen`
|
||||
为偏好的回复,`rejected`为拒绝的回复。
|
||||
|
||||
文件 `dpo.jsonl` 数据格式为
|
||||
|
||||
```text
|
||||
{
|
||||
"chosen": [
|
||||
{"content": "Q", "role": "user"},
|
||||
{"content": "good answer", "role": "assistant"}
|
||||
],
|
||||
"rejected": [
|
||||
{"content": "Q", "role": "user"},
|
||||
{"content": "bad answer", "role": "assistant"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Ⅴ Reason数据集:
|
||||
|
||||
不得不说2025年2月谁能火的过DeepSeek...
|
||||
也激发了我对RL引导的推理模型的浓厚兴趣,目前已经用Qwen2.5复现了R1-Zero。
|
||||
如果有时间+效果work(但99%基模能力不足)我会在之后更新MiniMind基于RL训练的推理模型而不是蒸馏模型。
|
||||
时间有限,最快的低成本方案依然是直接蒸馏(黑盒方式)。
|
||||
耐不住R1太火,短短几天就已经存在一些R1的蒸馏数据集[R1-Llama-70B](https://www.modelscope.cn/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B)、[R1-Distill-SFT](https://www.modelscope.cn/datasets/AI-ModelScope/R1-Distill-SFT)、
|
||||
[Alpaca-Distill-R1](https://huggingface.co/datasets/shareAI/Alpaca-Distill-R1-ZH)、
|
||||
[deepseek_r1_zh](https://huggingface.co/datasets/jinliuxi/deepseek_r1_zh)等等,纯中文的数据可能比较少。
|
||||
最终整合它们,导出文件为`r1_mix_1024.jsonl`,数据格式和`sft_X.jsonl`一致。
|
||||
|
||||
## Ⅵ 更多数据集
|
||||
|
||||
目前已经有[HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM)
|
||||
在收集和梳理中文LLM相关的开源模型、应用、数据集及教程等资料,并持续更新这方面的最新进展。全面且专业,Respect!
|
||||
|
||||
---
|
||||
|
||||
## Ⅷ 数据集下载
|
||||
|
||||
> [!NOTE]
|
||||
> 2025-02-05后,开源MiniMind最终训练所用的所有数据集,因此无需再自行预处理大规模数据集,避免重复性的数据处理工作。
|
||||
|
||||
MiniMind训练数据集 ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong))
|
||||
|
||||
> 无需全部clone,可单独下载所需的文件
|
||||
|
||||
将下载的数据集文件放到`./dataset/`目录下(✨为推荐的必须项)
|
||||
|
||||
```bash
|
||||
./dataset/
|
||||
├── dpo.jsonl (909MB)
|
||||
├── lora_identity.jsonl (22.8KB)
|
||||
├── lora_medical.jsonl (34MB)
|
||||
├── pretrain_hq.jsonl (1.6GB, ✨)
|
||||
├── r1_mix_1024.jsonl (340MB)
|
||||
├── sft_1024.jsonl (5.6GB)
|
||||
├── sft_2048.jsonl (9GB)
|
||||
├── sft_512.jsonl (7.5GB)
|
||||
├── sft_mini_512.jsonl (1.2GB, ✨)
|
||||
└── tokenizer_train.jsonl (1GB)
|
||||
```
|
||||
|
||||
<details style="color:rgb(128,128,128)">
|
||||
<summary>注:各数据集简介</summary>
|
||||
|
||||
* `dpo.jsonl` --RLHF阶段数据集
|
||||
* `lora_identity.jsonl` --自我认知数据集(例如:你是谁?我是minimind...),推荐用于lora训练(亦可用于全参SFT,勿被名字局限)
|
||||
* `lora_medical.jsonl` --医疗问答数据集,推荐用于lora训练(亦可用于全参SFT,勿被名字局限)
|
||||
* `pretrain_hq.jsonl`✨ --预训练数据集,整合自jiangshu科技
|
||||
* `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B蒸馏数据,每条数据字符最大长度为1024(因此训练时设置max_seq_len=1024)
|
||||
* `sft_1024.jsonl` --整合自Qwen2.5蒸馏数据(是sft_2048的子集),每条数据字符最大长度为1024(因此训练时设置max_seq_len=1024)
|
||||
* `sft_2048.jsonl` --整合自Qwen2.5蒸馏数据,每条数据字符最大长度为2048(因此训练时设置max_seq_len=2048)
|
||||
* `sft_512.jsonl` --整合自匠数科技SFT数据,每条数据字符最大长度为512(因此训练时设置max_seq_len=512)
|
||||
* `sft_mini_512.jsonl`✨ --极简整合自匠数科技SFT数据+Qwen2.5蒸馏数据(用于快速训练Zero模型),每条数据字符最大长度为512(因此训练时设置max_seq_len=512)
|
||||
* `tokenizer_train.jsonl` --均来自于`匠数大模型数据集`,这部分数据相对次要,(不推荐自己重复训练tokenizer,理由如上)如需自己训练tokenizer可以自由选择数据集。
|
||||
|
||||
</details>
|
||||
|
||||
|
||||

|
||||
|
||||
<details style="color:rgb(128,128,128)">
|
||||
<summary>说明 & 推荐训练方案</summary>
|
||||
|
||||
* MiniMind2 Series均经过共约20GB语料训练,大约4B tokens,即对应上面的数据组合训练结果(开销:💰💰💰💰💰💰💰💰,效果:😊😊😊😊😊😊)
|
||||
|
||||
* 想要最快速度从0实现Zero模型,推荐使用`pretrain_hq.jsonl` + `sft_mini_512.jsonl` 的数据组合,具体花销和效果可查看下文表格(开销:💰,效果:😊😊)
|
||||
|
||||
* 推荐具备一定算力资源或更在意效果的朋友可以考虑前者完整复现MiniMind2;仅有单卡GPU或在乎短时间快速复现的朋友强烈推荐后者;
|
||||
|
||||
* 【折中方案】亦可选择例如`sft_mini_512.jsonl`、`sft_1024.jsonl`中等规模数据进行自由组合训练(开销:💰💰💰,效果:😊😊😊😊)。
|
||||
|
||||
</details>
|
@ -1,126 +0,0 @@
|
||||
# 使用Accelerate+DeepSpeed进行分布式训练
|
||||
|
||||
本文档介绍如何使用Accelerate和DeepSpeed进行MiniMind模型的分布式训练。
|
||||
|
||||
## 环境准备
|
||||
|
||||
首先,确保安装了必要的依赖:
|
||||
|
||||
```bash
|
||||
pip install accelerate deepspeed
|
||||
```
|
||||
|
||||
## 配置文件说明
|
||||
|
||||
### 1. DeepSpeed配置文件 (ds_config.json)
|
||||
|
||||
DeepSpeed配置文件定义了优化器、学习率调度器和ZeRO优化等参数。主要配置包括:
|
||||
|
||||
- **ZeRO优化**:使用ZeRO-2进行优化,可以减少GPU内存使用
|
||||
- **优化器设置**:使用AdamW优化器
|
||||
- **混合精度训练**:支持FP16和BF16
|
||||
- **梯度累积**:通过"auto"自动设置,与训练脚本参数保持一致
|
||||
|
||||
### 2. Accelerate配置文件 (accelerate_config.yaml)
|
||||
|
||||
Accelerate配置文件定义了分布式训练的基本设置,包括:
|
||||
|
||||
- **分布式类型**:使用DeepSpeed
|
||||
- **混合精度**:使用BF16
|
||||
- **进程数量**:设置为4(可根据GPU数量调整)
|
||||
- **DeepSpeed配置**:指向ds_config.json文件
|
||||
|
||||
## 训练脚本说明
|
||||
|
||||
新的训练脚本`train_pretrain_accelerate.py`基于原有的`train_pretrain.py`修改而来,主要变化包括:
|
||||
|
||||
1. 使用Accelerator替代了PyTorch原生的分布式功能
|
||||
2. 移除了torchrun相关的分布式初始化代码
|
||||
3. 使用Accelerator的API进行模型、优化器和数据加载器的准备
|
||||
4. 使用Accelerator的API进行反向传播和梯度裁剪
|
||||
5. 处理了位置编码和未使用参数的问题
|
||||
|
||||
## 启动训练
|
||||
|
||||
有两种方式启动训练:
|
||||
|
||||
### 方法1:使用预先配置的accelerate配置文件
|
||||
|
||||
```bash
|
||||
accelerate launch --config_file accelerate_config.yaml train_pretrain_accelerate.py \
|
||||
--epochs 3 \
|
||||
--batch_size 24 \
|
||||
--learning_rate 2e-4 \
|
||||
--dtype bfloat16 \
|
||||
--accumulation_steps 32 \
|
||||
--grad_clip 1.0 \
|
||||
--log_interval 100 \
|
||||
--save_interval 10000 \
|
||||
--dim 1024 \
|
||||
--n_layers 32 \
|
||||
--max_seq_len 1024 \
|
||||
--use_flash_attn \
|
||||
--profile \
|
||||
--profile_interval 10
|
||||
```
|
||||
|
||||
### 方法2:使用命令行参数直接配置accelerate
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
|
||||
--multi_gpu \
|
||||
--num_processes=4 \
|
||||
--mixed_precision=bf16 \
|
||||
--main_process_port=29500 \
|
||||
--deepspeed_config_file ds_config.json \
|
||||
train_pretrain_accelerate.py \
|
||||
--epochs 3 \
|
||||
--batch_size 24 \
|
||||
--learning_rate 2e-4 \
|
||||
--dtype bfloat16 \
|
||||
--accumulation_steps 32 \
|
||||
--grad_clip 1.0 \
|
||||
--log_interval 100 \
|
||||
--save_interval 10000 \
|
||||
--dim 1024 \
|
||||
--n_layers 32 \
|
||||
--max_seq_len 1024 \
|
||||
--use_flash_attn \
|
||||
--profile \
|
||||
--profile_interval 10
|
||||
```
|
||||
|
||||
也可以直接使用提供的脚本:
|
||||
|
||||
```bash
|
||||
bash run_accelerate.sh
|
||||
```
|
||||
|
||||
## Accelerate与DeepSpeed配置的关系
|
||||
|
||||
1. **Accelerate**是一个高级API,用于简化分布式训练的设置和启动,它可以与多种分布式训练后端(如DeepSpeed、FSDP等)一起使用。
|
||||
|
||||
2. **DeepSpeed**是一个优化库,专注于大规模模型训练的内存优化和性能提升,提供了ZeRO优化等功能。
|
||||
|
||||
3. **配置关系**:
|
||||
- Accelerate配置文件(YAML)定义了使用哪种分布式后端以及基本的分布式设置
|
||||
- DeepSpeed配置文件(JSON)定义了DeepSpeed特有的优化参数
|
||||
- Accelerate通过`deepspeed_config_file`参数引用DeepSpeed配置文件
|
||||
|
||||
## 注意事项
|
||||
|
||||
1. **位置编码处理**:
|
||||
- 在模型中,`pos_cis`是一个复数张量,在分布式训练中需要特别处理
|
||||
- 在新的训练脚本中,我们使用Accelerator的API来处理这个问题,不再需要`_ddp_params_and_buffers_to_ignore`
|
||||
|
||||
2. **未使用参数处理**:
|
||||
- 原代码中使用`find_unused_parameters=True`来处理未使用的参数
|
||||
- 在新的训练脚本中,我们直接使用Accelerator的API,它会自动处理这个问题
|
||||
|
||||
3. **混合精度训练**:
|
||||
- DeepSpeed配置文件中的`fp16`和`bf16`设置为`"auto"`
|
||||
- 实际使用的精度由Accelerate的`--mixed_precision`参数决定
|
||||
|
||||
4. **梯度累积**:
|
||||
- DeepSpeed配置文件中的`gradient_accumulation_steps`设置为`"auto"`
|
||||
- 实际的梯度累积步数由训练脚本的`--accumulation_steps`参数决定
|
1509
README_en.md
Normal file
@ -1,17 +0,0 @@
|
||||
compute_environment: LOCAL_MACHINE
|
||||
deepspeed_config:
|
||||
deepspeed_config_file: ds_config.json
|
||||
zero3_init_flag: false
|
||||
distributed_type: DEEPSPEED
|
||||
downcast_bf16: 'no'
|
||||
machine_rank: 0
|
||||
main_training_function: main
|
||||
mixed_precision: bf16
|
||||
num_machines: 1
|
||||
num_processes: 4
|
||||
rdzv_backend: static
|
||||
same_network: true
|
||||
tpu_env: []
|
||||
tpu_use_cluster: false
|
||||
tpu_use_sudo: false
|
||||
use_cpu: false
|
@ -1,49 +0,0 @@
|
||||
{
|
||||
"train_batch_size": "auto",
|
||||
"train_micro_batch_size_per_gpu": "auto",
|
||||
"gradient_accumulation_steps": "auto",
|
||||
"gradient_clipping": "auto",
|
||||
"zero_optimization": {
|
||||
"stage": 2,
|
||||
"offload_optimizer": {
|
||||
"device": "cpu",
|
||||
"pin_memory": true
|
||||
},
|
||||
"allgather_partitions": true,
|
||||
"allgather_bucket_size": 5e8,
|
||||
"overlap_comm": true,
|
||||
"reduce_scatter": true,
|
||||
"reduce_bucket_size": 5e8,
|
||||
"contiguous_gradients": true
|
||||
},
|
||||
"fp16": {
|
||||
"enabled": "auto",
|
||||
"loss_scale": 0,
|
||||
"loss_scale_window": 1000,
|
||||
"initial_scale_power": 16,
|
||||
"hysteresis": 2,
|
||||
"min_loss_scale": 1
|
||||
},
|
||||
"bf16": {
|
||||
"enabled": "auto"
|
||||
},
|
||||
"optimizer": {
|
||||
"type": "AdamW",
|
||||
"params": {
|
||||
"lr": "auto",
|
||||
"betas": "auto",
|
||||
"eps": "auto",
|
||||
"weight_decay": "auto"
|
||||
}
|
||||
},
|
||||
"scheduler": {
|
||||
"type": "WarmupLR",
|
||||
"params": {
|
||||
"warmup_min_lr": "auto",
|
||||
"warmup_max_lr": "auto",
|
||||
"warmup_num_steps": "auto"
|
||||
}
|
||||
},
|
||||
"steps_per_print": 100,
|
||||
"wall_clock_breakdown": false
|
||||
}
|
BIN
images/1-wiki.png
Normal file
After Width: | Height: | Size: 136 KiB |
BIN
images/2-wiki.png
Normal file
After Width: | Height: | Size: 73 KiB |
BIN
images/3-wiki.png
Normal file
After Width: | Height: | Size: 230 KiB |
BIN
images/4-wiki.png
Normal file
After Width: | Height: | Size: 104 KiB |
BIN
images/5-wiki.png
Normal file
After Width: | Height: | Size: 239 KiB |
BIN
images/LLM-structure-moe.png
Normal file
After Width: | Height: | Size: 121 KiB |
BIN
images/LLM-structure.png
Executable file
After Width: | Height: | Size: 372 KiB |
BIN
images/and_huggingface.png
Normal file
After Width: | Height: | Size: 178 KiB |
BIN
images/and_modelscope.png
Normal file
After Width: | Height: | Size: 150 KiB |
BIN
images/compare_radar.png
Normal file
After Width: | Height: | Size: 519 KiB |
BIN
images/dataset.jpg
Normal file
After Width: | Height: | Size: 146 KiB |
BIN
images/gpt3_config.png
Normal file
After Width: | Height: | Size: 66 KiB |
BIN
images/logo.png
Normal file
After Width: | Height: | Size: 495 KiB |
BIN
images/logo2.png
Normal file
After Width: | Height: | Size: 615 KiB |
BIN
images/minimind2.gif
Normal file
After Width: | Height: | Size: 3.8 MiB |
BIN
images/pre_512_loss.png
Normal file
After Width: | Height: | Size: 559 KiB |
BIN
images/pre_768_loss.png
Normal file
After Width: | Height: | Size: 531 KiB |
BIN
images/sft_512_loss.png
Normal file
After Width: | Height: | Size: 1006 KiB |
BIN
images/sft_768_loss.png
Normal file
After Width: | Height: | Size: 943 KiB |
@ -36,9 +36,6 @@ class LMConfig(PretrainedConfig):
|
||||
aux_loss_alpha: float = 0.1,
|
||||
seq_aux: bool = True,
|
||||
norm_topk_prob: bool = True,
|
||||
####################################################
|
||||
knowlwdge_num: int = 64*64,
|
||||
knowlwdge_length: int = 8,
|
||||
**kwargs,
|
||||
):
|
||||
self.dim = dim
|
||||
@ -69,7 +66,4 @@ class LMConfig(PretrainedConfig):
|
||||
self.aux_loss_alpha = aux_loss_alpha # 辅助损失的alpha参数
|
||||
self.seq_aux = seq_aux # 是否在序列级别上计算辅助损失
|
||||
self.norm_topk_prob = norm_topk_prob # 是否标准化top-k概率
|
||||
####################################################
|
||||
self.knowlwdge_num = knowlwdge_num
|
||||
self.knowlwdge_length = knowlwdge_length
|
||||
super().__init__(**kwargs)
|
||||
|
@ -10,7 +10,7 @@ from sklearn.model_selection import train_test_split
|
||||
import os
|
||||
import ast
|
||||
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "true"
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
|
||||
|
||||
class PretrainDataset(Dataset):
|
||||
|
162
model/model.py
@ -31,7 +31,7 @@ class RMSNorm(torch.nn.Module):
|
||||
def forward(self, x):
|
||||
return self.weight * self._norm(x.float()).type_as(x)
|
||||
|
||||
# precompute_pos_cis 函数用于预计算位置编码(复数版本)。
|
||||
# precompute_pos_cis 函数用于预计算位置编码。
|
||||
def precompute_pos_cis(dim: int, end: int = int(32 * 1024), theta: float = 1e6):
|
||||
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
|
||||
t = torch.arange(end, device=freqs.device) # type: ignore
|
||||
@ -39,7 +39,7 @@ def precompute_pos_cis(dim: int, end: int = int(32 * 1024), theta: float = 1e6):
|
||||
pos_cis = torch.polar(torch.ones_like(freqs), freqs) # complex64
|
||||
return pos_cis
|
||||
|
||||
# apply_rotary_emb 函数用于应用旋转位置编码(复数版本)。
|
||||
# apply_rotary_emb 函数用于应用旋转位置编码。
|
||||
def apply_rotary_emb(xq, xk, pos_cis):
|
||||
def unite_shape(pos_cis, x):
|
||||
ndim = x.ndim
|
||||
@ -55,92 +55,6 @@ def apply_rotary_emb(xq, xk, pos_cis):
|
||||
xk_out = torch.view_as_real(xk_ * pos_cis).flatten(3)
|
||||
return xq_out.type_as(xq), xk_out.type_as(xk)
|
||||
|
||||
# precompute_pos_cis_real 函数用于预计算位置编码(实数版本)。
|
||||
def precompute_pos_cis_real(dim: int, end: int = int(32 * 1024), theta: float = 1e6):
|
||||
"""使用实数张量实现位置编码,避免使用复数张量
|
||||
|
||||
这个函数与precompute_pos_cis完全等价,但使用实数张量而非复数张量。
|
||||
原始函数生成形状为[seq_len, dim//2]的复数张量,其中实部全为1,虚部为旋转角度。
|
||||
这个函数生成形状为[seq_len, dim]的实数张量,其中偶数索引是cos(角度),奇数索引是sin(角度)。
|
||||
"""
|
||||
# 确保dim是偶数
|
||||
if dim % 2 != 0:
|
||||
raise ValueError(f"维度必须是偶数,但得到了 {dim}")
|
||||
|
||||
# 复制原始函数的频率计算逻辑
|
||||
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
|
||||
t = torch.arange(end, device=freqs.device)
|
||||
freqs = torch.outer(t, freqs).float()
|
||||
|
||||
# 计算cos和sin值
|
||||
# 在复数版本中,pos_cis = torch.polar(torch.ones_like(freqs), freqs)
|
||||
# 等价于 cos(freqs) + i*sin(freqs)
|
||||
cos = torch.cos(freqs)
|
||||
sin = torch.sin(freqs)
|
||||
|
||||
# 创建实数张量,交错排列cos和sin
|
||||
pos_emb = torch.zeros((end, dim), device=freqs.device)
|
||||
pos_emb[:, 0::2] = cos # 偶数索引放cos
|
||||
pos_emb[:, 1::2] = sin # 奇数索引放sin
|
||||
|
||||
return pos_emb
|
||||
|
||||
# apply_rotary_emb_real 函数用于应用旋转位置编码(实数版本)。
|
||||
def apply_rotary_emb_real(xq, xk, pos_emb):
|
||||
"""使用实数张量实现旋转位置编码,避免使用复数张量
|
||||
|
||||
这个函数与apply_rotary_emb完全等价,但使用实数张量而非复数张量。
|
||||
原始函数将输入张量转换为复数形式,与位置编码相乘,然后再转回实数形式。
|
||||
这个函数直接使用实数运算实现相同的旋转操作。
|
||||
"""
|
||||
# 获取形状信息
|
||||
bsz, seq_len, n_heads, head_dim = xq.shape
|
||||
|
||||
# 确保pos_emb形状正确
|
||||
assert pos_emb.shape[0] >= seq_len, f"位置编码长度 {pos_emb.shape[0]} 小于序列长度 {seq_len}"
|
||||
assert pos_emb.shape[1] == head_dim, f"位置编码维度 {pos_emb.shape[1]} 与头维度 {head_dim} 不匹配"
|
||||
|
||||
# 截取需要的位置编码长度
|
||||
pos_emb = pos_emb[:seq_len]
|
||||
|
||||
# 将pos_emb调整为广播形状 [1, seq_len, 1, head_dim]
|
||||
pos_emb = pos_emb.unsqueeze(0).unsqueeze(2)
|
||||
|
||||
# 将head_dim分成两半
|
||||
half_head_dim = head_dim // 2
|
||||
|
||||
# 提取cos和sin值(偶数索引是cos,奇数索引是sin)
|
||||
cos = pos_emb[..., 0::2]
|
||||
sin = pos_emb[..., 1::2]
|
||||
|
||||
# 将xq和xk重新排列,以便进行旋转操作
|
||||
# 原始复数版本中,xq和xk被重塑为复数张量,其中实部和虚部交错排列
|
||||
# 在实数版本中,我们需要将偶数索引和奇数索引分开处理
|
||||
|
||||
# 分离偶数和奇数索引
|
||||
xq_even = xq[..., 0::2] # 偶数索引,对应复数的实部
|
||||
xq_odd = xq[..., 1::2] # 奇数索引,对应复数的虚部
|
||||
xk_even = xk[..., 0::2]
|
||||
xk_odd = xk[..., 1::2]
|
||||
|
||||
# 应用旋转(等价于复数乘法)
|
||||
# (a + bi)(cos + sin*i) = (a*cos - b*sin) + (a*sin + b*cos)i
|
||||
# 其中a是偶数索引,b是奇数索引
|
||||
xq_out_even = xq_even * cos - xq_odd * sin # 新的偶数索引(实部)
|
||||
xq_out_odd = xq_even * sin + xq_odd * cos # 新的奇数索引(虚部)
|
||||
xk_out_even = xk_even * cos - xk_odd * sin
|
||||
xk_out_odd = xk_even * sin + xk_odd * cos
|
||||
|
||||
# 重新组合偶数和奇数索引
|
||||
xq_out = torch.zeros_like(xq)
|
||||
xk_out = torch.zeros_like(xk)
|
||||
xq_out[..., 0::2] = xq_out_even
|
||||
xq_out[..., 1::2] = xq_out_odd
|
||||
xk_out[..., 0::2] = xk_out_even
|
||||
xk_out[..., 1::2] = xk_out_odd
|
||||
|
||||
return xq_out.type_as(xq), xk_out.type_as(xk)
|
||||
|
||||
# repeat_kv 函数用于重复键值对。
|
||||
def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
|
||||
"""torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
|
||||
@ -179,6 +93,8 @@ class Attention(nn.Module):
|
||||
def forward(self,
|
||||
x: torch.Tensor,
|
||||
pos_cis: torch.Tensor,
|
||||
past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
|
||||
use_cache=True,
|
||||
db_value=None):
|
||||
bsz, seq_len, _ = x.shape #bsz: 批量大小, seq_len: 序列长度, _: 隐藏维度
|
||||
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x) #将输入张量x分别通过线性层wq, wk, wv进行变换,得到查询、键和值。
|
||||
@ -186,13 +102,13 @@ class Attention(nn.Module):
|
||||
xk = xk.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim) #将变换后的张量xk重塑为形状为(bsz, seq_len, n_local_kv_heads, head_dim)的形状。
|
||||
xv = xv.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim) #将变换后的张量xv重塑为形状为(bsz, seq_len, n_local_kv_heads, head_dim)的形状。
|
||||
|
||||
# 应用旋转位置编码(使用实数版本)
|
||||
xq, xk = apply_rotary_emb_real(xq, xk, pos_cis)
|
||||
# kv_cache实现 REMOVED
|
||||
# if past_key_value is not None:
|
||||
# xk = torch.cat([past_key_value[0], xk], dim=1)
|
||||
# xv = torch.cat([past_key_value[1], xv], dim=1)
|
||||
# past_kv = (xk, xv) if use_cache else None
|
||||
# 应用旋转位置编码
|
||||
xq, xk = apply_rotary_emb(xq, xk, pos_cis)
|
||||
# kv_cache实现
|
||||
if past_key_value is not None:
|
||||
xk = torch.cat([past_key_value[0], xk], dim=1)
|
||||
xv = torch.cat([past_key_value[1], xv], dim=1)
|
||||
past_kv = (xk, xv) if use_cache else None
|
||||
|
||||
# 重复键值对
|
||||
xq, xk, xv = (
|
||||
@ -245,7 +161,7 @@ class Attention(nn.Module):
|
||||
|
||||
output = output.transpose(1, 2).reshape(bsz, seq_len, -1)
|
||||
output = self.resid_dropout(self.wo(output))
|
||||
return output
|
||||
return output, past_kv
|
||||
|
||||
|
||||
|
||||
@ -457,7 +373,7 @@ class MiniMindBlock(nn.Module):
|
||||
# self.product_key_topk = min(16, self.num_keys) # 确保不超过num_keys
|
||||
# self.num_experts_per_head_topk = 1 # 最终每个头选取的专家数
|
||||
|
||||
def forward(self, x, db_value, pos_cis):
|
||||
def forward(self, x, db_value, pos_cis, past_key_value=None, use_cache=True):
|
||||
# import pdb;pdb.set_trace()
|
||||
# db_value = None
|
||||
|
||||
@ -502,9 +418,11 @@ class MiniMindBlock(nn.Module):
|
||||
|
||||
|
||||
# 注意力计算
|
||||
h_attn = self.attention(
|
||||
h_attn, past_kv = self.attention(
|
||||
self.attention_norm(x),
|
||||
pos_cis,
|
||||
past_key_value=past_key_value,
|
||||
use_cache=use_cache,
|
||||
db_value=db_value
|
||||
)
|
||||
|
||||
@ -515,7 +433,7 @@ class MiniMindBlock(nn.Module):
|
||||
|
||||
# 前馈神经网络
|
||||
out = h + self.feed_forward(self.ffn_norm(h))
|
||||
return out
|
||||
return out, past_kv
|
||||
|
||||
class ExtractDB(nn.Module):
|
||||
def __init__(self,params):
|
||||
@ -524,15 +442,15 @@ class ExtractDB(nn.Module):
|
||||
self.batch_size = None
|
||||
self.dim = params.dim
|
||||
self.dim_key = self.dim // 2
|
||||
self.knowlwdge_num = params.knowlwdge_num # 100专家,确保是完全平方数
|
||||
self.num_experts = 10 * 10 # 100专家,确保是完全平方数
|
||||
# 将knowledge_dim设置为与head_dim相同,以便在attention中直接使用
|
||||
self.head_dim = params.dim // params.n_heads
|
||||
self.knowledge_length = params.knowlwdge_length*params.dim
|
||||
self.knowledge_dim = 8*params.dim
|
||||
|
||||
# 使用register_buffer代替nn.Parameter,避免梯度问题
|
||||
self.register_buffer('weight_down_embed', torch.randn(self.knowlwdge_num, self.knowledge_length) * 0.02)
|
||||
self.register_buffer('weight_down_embed', torch.randn(self.num_experts, self.knowledge_dim) * 0.02)
|
||||
|
||||
self.num_keys = int(math.sqrt(self.knowlwdge_num)) if self.knowlwdge_num > 0 else 0
|
||||
self.num_keys = int(math.sqrt(self.num_experts)) if self.num_experts > 0 else 0
|
||||
self.product_key_topk = min(16, self.num_keys)
|
||||
self.keys = nn.Parameter(torch.randn(self.num_keys, 2, self.dim_key) * 0.02)
|
||||
self.num_experts_per_head_topk = 1
|
||||
@ -630,19 +548,22 @@ class MiniMindLM(PreTrainedModel):
|
||||
self.downsample_q_specific = nn.Sequential(
|
||||
nn.Conv1d(128*8, 512, kernel_size=1, padding='same')
|
||||
)
|
||||
# 使用实数版本的位置编码,避免复数张量可能导致的段错误
|
||||
self.register_buffer("pos_cis_real",
|
||||
precompute_pos_cis_real(dim=params.dim // params.n_heads, theta=params.rope_theta),
|
||||
self.register_buffer("pos_cis",
|
||||
precompute_pos_cis(dim=params.dim // params.n_heads, theta=params.rope_theta),
|
||||
persistent=False)
|
||||
self.params = params
|
||||
|
||||
def forward(self,
|
||||
input_ids: Optional[torch.Tensor] = None,
|
||||
past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
|
||||
use_cache: bool = False,
|
||||
logits_to_keep: Union[int, torch.Tensor] = 0,
|
||||
**args):
|
||||
past_key_values = past_key_values or [None] * len(self.layers)
|
||||
start_pos = args.get('start_pos', 0)
|
||||
h = self.dropout(self.tok_embeddings(input_ids))
|
||||
pos_cis_real = self.pos_cis_real[start_pos:start_pos + input_ids.size(1)]
|
||||
pos_cis = self.pos_cis[start_pos:start_pos + input_ids.size(1)]
|
||||
past_kvs = []
|
||||
h_list = []
|
||||
|
||||
for l, layer in enumerate(self.layers):
|
||||
@ -657,10 +578,13 @@ class MiniMindLM(PreTrainedModel):
|
||||
index = self.extract_db.q_to_k(h)
|
||||
db_value = self.extract_db.get_data(index)
|
||||
|
||||
h = layer(
|
||||
h, db_value, pos_cis_real
|
||||
h, past_kv = layer(
|
||||
h, db_value, pos_cis,
|
||||
past_key_value=past_key_values[l],
|
||||
use_cache=use_cache
|
||||
)
|
||||
|
||||
past_kvs.append(past_kv)
|
||||
h_list.append(h.unsqueeze(0))
|
||||
|
||||
h_tensor = torch.cat(h_list, dim=0).permute(1, 0, 2, 3)
|
||||
@ -687,6 +611,7 @@ class MiniMindLM(PreTrainedModel):
|
||||
# 进一步简化,只保留必要的参数
|
||||
output = CausalLMOutputWithPast(
|
||||
logits=logits,
|
||||
past_key_values=past_kvs,
|
||||
)
|
||||
output.hidden_states = h
|
||||
|
||||
@ -702,17 +627,17 @@ class MiniMindLM(PreTrainedModel):
|
||||
|
||||
@torch.inference_mode()
|
||||
def generate(self, input_ids, eos_token_id=2, max_new_tokens=1024, temperature=0.75, top_p=0.90,
|
||||
stream=False, rp=1., pad_token_id=0, num_return_sequences=1, **args):
|
||||
stream=False, rp=1., use_cache=True, pad_token_id=0, num_return_sequences=1, **args):
|
||||
# 流式生成
|
||||
if stream:
|
||||
return self._stream(input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, **args)
|
||||
return self._stream(input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args)
|
||||
|
||||
# 直接生成
|
||||
generated = []
|
||||
for i in range(input_ids.size(0)):
|
||||
non_pad = input_ids[i][input_ids[i] != pad_token_id].unsqueeze(0)
|
||||
for _ in range(num_return_sequences):
|
||||
out = self._stream(non_pad, eos_token_id, max_new_tokens, temperature, top_p, rp, **args)
|
||||
out = self._stream(non_pad, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args)
|
||||
tokens_list = [tokens[:, -1:] for tokens in out]
|
||||
gen = torch.cat(tokens_list, dim=-1) if tokens_list else non_pad
|
||||
full_sequence = torch.cat([non_pad, gen], dim=-1)
|
||||
@ -729,14 +654,15 @@ class MiniMindLM(PreTrainedModel):
|
||||
res = output.view(input_ids.size(0) * num_return_sequences, -1)
|
||||
return res
|
||||
|
||||
def _stream(self, input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, **args):
|
||||
start, first_seq = input_ids.shape[1], True
|
||||
def _stream(self, input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args):
|
||||
start, first_seq, past_kvs = input_ids.shape[1], True, None
|
||||
while input_ids.shape[1] < max_new_tokens - 1:
|
||||
if first_seq:
|
||||
out, first_seq = self(input_ids, **args), False
|
||||
if first_seq or not use_cache:
|
||||
out, first_seq = self(input_ids, past_key_values=past_kvs, use_cache=use_cache, **args), False
|
||||
else:
|
||||
out = self(input_ids[:, -1:], start_pos=input_ids.shape[1] - 1, **args)
|
||||
logits = out.logits[:, -1, :]
|
||||
out = self(input_ids[:, -1:], past_key_values=past_kvs, use_cache=use_cache,
|
||||
start_pos=input_ids.shape[1] - 1, **args)
|
||||
logits, past_kvs = out.logits[:, -1, :], out.past_key_values
|
||||
logits[:, list(set(input_ids.tolist()[0]))] /= rp
|
||||
logits /= (temperature + 1e-9)
|
||||
if top_p is not None and top_p < 1.0:
|
||||
|
133
requirements.txt
@ -1,147 +1,30 @@
|
||||
accelerate==1.6.0
|
||||
aiohappyeyeballs==2.6.1
|
||||
aiohttp==3.11.17
|
||||
aiosignal==1.3.2
|
||||
altair==5.5.0
|
||||
annotated-types==0.7.0
|
||||
anyio==4.9.0
|
||||
async-timeout==5.0.1
|
||||
attrs==25.3.0
|
||||
blinker==1.9.0
|
||||
cachetools==5.5.2
|
||||
certifi==2025.1.31
|
||||
charset-normalizer==3.4.1
|
||||
click==8.1.8
|
||||
contourpy==1.3.2
|
||||
cycler==0.12.1
|
||||
datasets==2.21.0
|
||||
datasketch==1.6.4
|
||||
deepspeed==0.16.7
|
||||
dill==0.3.8
|
||||
distro==1.9.0
|
||||
docker-pycreds==0.4.0
|
||||
einops==0.8.1
|
||||
exceptiongroup==1.2.2
|
||||
filelock==3.18.0
|
||||
Flask==3.0.3
|
||||
Flask-Cors==4.0.0
|
||||
fonttools==4.57.0
|
||||
frozenlist==1.6.0
|
||||
fsspec==2024.6.1
|
||||
gitdb==4.0.12
|
||||
GitPython==3.1.44
|
||||
h11==0.14.0
|
||||
hjson==3.1.0
|
||||
httpcore==1.0.8
|
||||
httpx==0.28.1
|
||||
huggingface-hub==0.30.2
|
||||
idna==3.10
|
||||
importlib_metadata==7.2.1
|
||||
itsdangerous==2.2.0
|
||||
Flask_Cors==4.0.0
|
||||
jieba==0.42.1
|
||||
Jinja2==3.1.2
|
||||
jiter==0.9.0
|
||||
joblib==1.4.2
|
||||
jsonlines==4.0.0
|
||||
jsonschema==4.23.0
|
||||
jsonschema-specifications==2024.10.1
|
||||
kiwisolver==1.4.8
|
||||
markdown-it-py==3.0.0
|
||||
MarkupSafe==3.0.2
|
||||
marshmallow==3.22.0
|
||||
matplotlib==3.10.0
|
||||
mdurl==0.1.2
|
||||
modelscope==1.25.0
|
||||
mpmath==1.3.0
|
||||
msgpack==1.1.0
|
||||
multidict==6.4.3
|
||||
multiprocess==0.70.16
|
||||
narwhals==1.35.0
|
||||
networkx==3.4.2
|
||||
ngrok==1.4.0
|
||||
ninja==1.11.1.4
|
||||
nltk==3.8
|
||||
numpy==1.26.4
|
||||
nvidia-cublas-cu11==11.11.3.6
|
||||
nvidia-cublas-cu12==12.1.3.1
|
||||
nvidia-cuda-cupti-cu11==11.8.87
|
||||
nvidia-cuda-cupti-cu12==12.1.105
|
||||
nvidia-cuda-nvrtc-cu11==11.8.89
|
||||
nvidia-cuda-nvrtc-cu12==12.1.105
|
||||
nvidia-cuda-runtime-cu11==11.8.89
|
||||
nvidia-cuda-runtime-cu12==12.1.105
|
||||
nvidia-cudnn-cu11==9.1.0.70
|
||||
nvidia-cudnn-cu12==8.9.2.26
|
||||
nvidia-cufft-cu11==10.9.0.58
|
||||
nvidia-cufft-cu12==11.0.2.54
|
||||
nvidia-curand-cu11==10.3.0.86
|
||||
nvidia-curand-cu12==10.3.2.106
|
||||
nvidia-cusolver-cu11==11.4.1.48
|
||||
nvidia-cusolver-cu12==11.4.5.107
|
||||
nvidia-cusparse-cu11==11.7.5.86
|
||||
nvidia-cusparse-cu12==12.1.0.106
|
||||
nvidia-nccl-cu11==2.21.5
|
||||
nvidia-nccl-cu12==2.19.3
|
||||
nvidia-nvjitlink-cu12==12.8.93
|
||||
nvidia-nvtx-cu11==11.8.86
|
||||
nvidia-nvtx-cu12==12.1.105
|
||||
openai==1.59.6
|
||||
packaging==23.2
|
||||
pandas==1.5.3
|
||||
peft==0.7.1
|
||||
pillow==10.4.0
|
||||
platformdirs==4.3.7
|
||||
propcache==0.3.1
|
||||
protobuf==4.25.6
|
||||
psutil==5.9.8
|
||||
py-cpuinfo==9.0.0
|
||||
pyarrow==19.0.1
|
||||
pydantic==2.8.2
|
||||
pydantic_core==2.20.1
|
||||
pydeck==0.9.1
|
||||
Pygments==2.19.1
|
||||
pyparsing==3.2.3
|
||||
python-dateutil==2.9.0.post0
|
||||
pytz==2025.2
|
||||
PyYAML==6.0.2
|
||||
referencing==0.36.2
|
||||
regex==2024.11.6
|
||||
requests==2.32.3
|
||||
rich==13.7.1
|
||||
rpds-py==0.24.0
|
||||
safetensors==0.5.3
|
||||
scikit-learn==1.5.1
|
||||
scipy==1.15.2
|
||||
sentence-transformers==2.3.1
|
||||
sentencepiece==0.2.0
|
||||
sentry-sdk==2.26.1
|
||||
setproctitle==1.3.5
|
||||
scikit_learn==1.5.1
|
||||
sentence_transformers==2.3.1
|
||||
simhash==2.1.2
|
||||
six==1.17.0
|
||||
smmap==5.0.2
|
||||
sniffio==1.3.1
|
||||
streamlit==1.30.0
|
||||
sympy==1.13.3
|
||||
tenacity==8.5.0
|
||||
threadpoolctl==3.6.0
|
||||
tiktoken==0.5.1
|
||||
tokenizers==0.21.1
|
||||
toml==0.10.2
|
||||
torch==2.7.0+cu118
|
||||
torchvision==0.22.0+cu118
|
||||
tornado==6.4.2
|
||||
tqdm==4.67.1
|
||||
transformers==4.48.0
|
||||
triton==3.3.0
|
||||
jinja2==3.1.2
|
||||
jsonlines==4.0.0
|
||||
trl==0.13.0
|
||||
typing_extensions==4.13.2
|
||||
tzlocal==5.3.1
|
||||
ujson==5.1.0
|
||||
urllib3==2.4.0
|
||||
validators==0.34.0
|
||||
wandb==0.18.3
|
||||
watchdog==6.0.0
|
||||
Werkzeug==3.1.3
|
||||
xxhash==3.5.0
|
||||
yarl==1.20.0
|
||||
zipp==3.21.0
|
||||
streamlit==1.30.0
|
||||
torch==2.2.2
|
||||
torchvision==0.17.2
|
@ -1,48 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# 激活conda环境
|
||||
source $(conda info --base)/etc/profile.d/conda.sh
|
||||
conda activate ycz_accelerate
|
||||
|
||||
# 设置环境变量以帮助调试
|
||||
export NCCL_DEBUG=INFO
|
||||
export PYTHONFAULTHANDLER=1
|
||||
|
||||
# 方法1: 使用预先配置的accelerate配置文件
|
||||
# accelerate launch --config_file accelerate_config.yaml train_pretrain_accelerate.py \
|
||||
# --epochs 3 \
|
||||
# --batch_size 24 \
|
||||
# --learning_rate 2e-4 \
|
||||
# --dtype bfloat16 \
|
||||
# --accumulation_steps 32 \
|
||||
# --grad_clip 1.0 \
|
||||
# --log_interval 100 \
|
||||
# --save_interval 10000 \
|
||||
# --dim 1024 \
|
||||
# --n_layers 32 \
|
||||
# --max_seq_len 1024 \
|
||||
# --use_flash_attn \
|
||||
# --profile \
|
||||
# --profile_interval 10
|
||||
|
||||
# 方法2: 使用命令行参数直接配置accelerate
|
||||
CUDA_VISIBLE_DEVICES=0 accelerate launch \
|
||||
--multi_gpu \
|
||||
--num_processes=4 \
|
||||
--mixed_precision=bf16 \
|
||||
--main_process_port=29500 \
|
||||
train_pretrain_accelerate.py \
|
||||
--epochs 3 \
|
||||
--batch_size 24 \
|
||||
--learning_rate 2e-4 \
|
||||
--dtype bfloat16 \
|
||||
--accumulation_steps 32 \
|
||||
--grad_clip 1.0 \
|
||||
--log_interval 100 \
|
||||
--save_interval 10000 \
|
||||
--dim 512 \
|
||||
--n_layers 12 \
|
||||
--max_seq_len 512 \
|
||||
--use_flash_attn \
|
||||
--profile \
|
||||
--profile_interval 10
|
@ -1,48 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# 激活conda环境
|
||||
source $(conda info --base)/etc/profile.d/conda.sh
|
||||
conda activate ycz_accelerate
|
||||
|
||||
# 设置环境变量以帮助调试
|
||||
export NCCL_DEBUG=INFO
|
||||
export PYTHONFAULTHANDLER=1
|
||||
|
||||
# 方法1: 使用预先配置的accelerate配置文件
|
||||
# accelerate launch --config_file accelerate_config.yaml train_pretrain_accelerate.py \
|
||||
# --epochs 3 \
|
||||
# --batch_size 24 \
|
||||
# --learning_rate 2e-4 \
|
||||
# --dtype bfloat16 \
|
||||
# --accumulation_steps 32 \
|
||||
# --grad_clip 1.0 \
|
||||
# --log_interval 100 \
|
||||
# --save_interval 10000 \
|
||||
# --dim 1024 \
|
||||
# --n_layers 32 \
|
||||
# --max_seq_len 1024 \
|
||||
# --use_flash_attn \
|
||||
# --profile \
|
||||
# --profile_interval 10
|
||||
|
||||
# 方法2: 使用命令行参数直接配置accelerate
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
|
||||
--multi_gpu \
|
||||
--num_processes=4 \
|
||||
--mixed_precision=bf16 \
|
||||
--main_process_port=29500 \
|
||||
train_pretrain_accelerate.py \
|
||||
--epochs 3 \
|
||||
--batch_size 24 \
|
||||
--learning_rate 2e-4 \
|
||||
--dtype bfloat16 \
|
||||
--accumulation_steps 32 \
|
||||
--grad_clip 1.0 \
|
||||
--log_interval 100 \
|
||||
--save_interval 10000 \
|
||||
--dim 1024 \
|
||||
--n_layers 32 \
|
||||
--max_seq_len 1024 \
|
||||
--use_flash_attn \
|
||||
--profile \
|
||||
--profile_interval 10
|
@ -1,97 +0,0 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
"""
|
||||
测试实数版本的位置编码
|
||||
"""
|
||||
|
||||
import torch
|
||||
from model.model import precompute_pos_cis, precompute_pos_cis_real, apply_rotary_emb, apply_rotary_emb_real
|
||||
from model.LMConfig import LMConfig
|
||||
from model.model import MiniMindLM
|
||||
|
||||
def test_pos_encoding_equivalence():
|
||||
"""测试复数版本和实数版本的位置编码是否等价"""
|
||||
print("测试位置编码等价性...")
|
||||
|
||||
# 参数设置
|
||||
dim = 64
|
||||
seq_len = 10
|
||||
|
||||
# 生成复数版本的位置编码
|
||||
pos_cis = precompute_pos_cis(dim=dim, end=seq_len)
|
||||
|
||||
# 生成实数版本的位置编码
|
||||
pos_cis_real = precompute_pos_cis_real(dim=dim, end=seq_len)
|
||||
|
||||
# 创建随机查询和键
|
||||
batch_size = 2
|
||||
n_heads = 4
|
||||
head_dim = dim
|
||||
|
||||
xq = torch.randn(batch_size, seq_len, n_heads, head_dim)
|
||||
xk = torch.randn(batch_size, seq_len, n_heads, head_dim)
|
||||
|
||||
# 应用复数版本的旋转位置编码
|
||||
xq_complex, xk_complex = apply_rotary_emb(xq, xk, pos_cis)
|
||||
|
||||
# 应用实数版本的旋转位置编码
|
||||
xq_real, xk_real = apply_rotary_emb_real(xq, xk, pos_cis_real)
|
||||
|
||||
# 计算差异
|
||||
q_diff = torch.abs(xq_complex - xq_real).mean().item()
|
||||
k_diff = torch.abs(xk_complex - xk_real).mean().item()
|
||||
|
||||
print(f"查询差异: {q_diff:.6f}")
|
||||
print(f"键差异: {k_diff:.6f}")
|
||||
|
||||
# 检查差异是否在可接受范围内
|
||||
tolerance = 1e-5
|
||||
if q_diff < tolerance and k_diff < tolerance:
|
||||
print("✅ 测试通过: 复数版本和实数版本的位置编码在数值上等价")
|
||||
else:
|
||||
print("❌ 测试失败: 复数版本和实数版本的位置编码存在显著差异")
|
||||
|
||||
def test_model_forward():
|
||||
"""测试模型前向传播"""
|
||||
print("\n测试模型前向传播...")
|
||||
|
||||
# 创建模型配置
|
||||
config = LMConfig(
|
||||
dim=128,
|
||||
n_layers=2,
|
||||
n_heads=4,
|
||||
n_kv_heads=4, # 确保n_kv_heads被设置,且n_heads能被n_kv_heads整除
|
||||
vocab_size=1000,
|
||||
max_seq_len=128,
|
||||
disable_db=True # 禁用数据库功能,避免额外的复杂性
|
||||
)
|
||||
|
||||
# 创建模型
|
||||
try:
|
||||
model = MiniMindLM(config)
|
||||
print(f"✅ 模型初始化成功")
|
||||
except Exception as e:
|
||||
print(f"❌ 模型初始化失败: {str(e)}")
|
||||
return
|
||||
|
||||
# 创建输入
|
||||
batch_size = 2
|
||||
seq_len = 10
|
||||
input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
|
||||
|
||||
# 前向传播
|
||||
try:
|
||||
with torch.no_grad():
|
||||
outputs = model(input_ids)
|
||||
print(f"✅ 模型前向传播成功")
|
||||
print(f"输出形状: {outputs.logits.shape}")
|
||||
except Exception as e:
|
||||
print(f"❌ 模型前向传播失败: {str(e)}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
# 测试位置编码等价性
|
||||
test_pos_encoding_equivalence()
|
||||
|
||||
# 测试模型前向传播
|
||||
test_model_forward()
|
@ -13,7 +13,6 @@ from torch import optim, nn
|
||||
from torch.nn.parallel import DistributedDataParallel
|
||||
from torch.optim.lr_scheduler import CosineAnnealingLR
|
||||
from torch.utils.data import DataLoader, DistributedSampler
|
||||
# 移除通信分析工具导入
|
||||
from contextlib import nullcontext
|
||||
from typing import Optional
|
||||
|
||||
@ -43,67 +42,18 @@ def train_epoch(epoch, wandb):
|
||||
start_time = time.time()
|
||||
# 在函数开始处定义moe_path,避免在异常处理中引用未定义变量
|
||||
moe_path = '_moe' if lm_config.use_moe else ''
|
||||
|
||||
# 添加CUDA事件来分析性能
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
data_start = torch.cuda.Event(enable_timing=True)
|
||||
data_end = torch.cuda.Event(enable_timing=True)
|
||||
forward_start = torch.cuda.Event(enable_timing=True)
|
||||
forward_end = torch.cuda.Event(enable_timing=True)
|
||||
backward_start = torch.cuda.Event(enable_timing=True)
|
||||
backward_end = torch.cuda.Event(enable_timing=True)
|
||||
optimizer_start = torch.cuda.Event(enable_timing=True)
|
||||
optimizer_end = torch.cuda.Event(enable_timing=True)
|
||||
|
||||
# 移除CUDA图优化代码
|
||||
|
||||
# 预取数据
|
||||
prefetch_factor = 2 # 预取的批次数
|
||||
data_iter = iter(train_loader)
|
||||
prefetch_batches = []
|
||||
|
||||
# 预取初始批次
|
||||
for _ in range(min(prefetch_factor, len(train_loader))):
|
||||
for step, (X, Y, loss_mask) in enumerate(train_loader):
|
||||
try:
|
||||
batch = next(data_iter)
|
||||
prefetch_batches.append([t.to(args.device, non_blocking=True) for t in batch])
|
||||
except StopIteration:
|
||||
break
|
||||
|
||||
for step in range(len(train_loader)):
|
||||
try:
|
||||
# 计时数据加载
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
data_start.record()
|
||||
|
||||
# 使用预取的数据
|
||||
if prefetch_batches:
|
||||
X, Y, loss_mask = prefetch_batches.pop(0)
|
||||
else:
|
||||
# 如果预取队列为空,直接加载
|
||||
X, Y, loss_mask = [t.to(args.device) for t in next(data_iter)]
|
||||
|
||||
# 异步预取下一批数据
|
||||
if step + prefetch_factor < len(train_loader):
|
||||
try:
|
||||
batch = next(data_iter)
|
||||
prefetch_batches.append([t.to(args.device, non_blocking=True) for t in batch])
|
||||
except StopIteration:
|
||||
pass
|
||||
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
data_end.record()
|
||||
# 将数据加载到设备上
|
||||
X = X.to(args.device)
|
||||
Y = Y.to(args.device)
|
||||
loss_mask = loss_mask.to(args.device)
|
||||
|
||||
# 更新学习率
|
||||
lr = get_lr(epoch * iter_per_epoch + step, args.epochs * iter_per_epoch, args.learning_rate)
|
||||
for param_group in optimizer.param_groups:
|
||||
param_group['lr'] = lr
|
||||
|
||||
# 计时前向传播
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
forward_start.record()
|
||||
|
||||
# 常规前向传播
|
||||
with ctx:
|
||||
res = model(X)
|
||||
loss = loss_fct(
|
||||
@ -127,13 +77,6 @@ def train_epoch(epoch, wandb):
|
||||
# 如果出错,不添加辅助损失
|
||||
loss = loss / args.accumulation_steps
|
||||
|
||||
# 反向传播
|
||||
scaler.scale(loss).backward()
|
||||
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
forward_end.record()
|
||||
backward_start.record()
|
||||
|
||||
# Print data types for debugging
|
||||
if step == 0 and (not ddp or dist.get_rank() == 0): # Print only for the first step of the first epoch on the main process
|
||||
Logger("---- Data Type Check ----")
|
||||
@ -146,21 +89,9 @@ def train_epoch(epoch, wandb):
|
||||
Logger(f"loss.dtype: {loss.dtype}")
|
||||
Logger("-------------------------")
|
||||
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
backward_end.record()
|
||||
scaler.scale(loss).backward()
|
||||
|
||||
# 在每一步都进行性能分析,而不仅仅是在梯度累积完成时
|
||||
if (step + 1) % args.profile_interval == 0:
|
||||
# 记录优化器时间(如果是梯度累积步骤)
|
||||
if (step + 1) % args.accumulation_steps == 0:
|
||||
optimizer_start.record()
|
||||
|
||||
# 优化器步骤
|
||||
if (step + 1) % args.accumulation_steps == 0:
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
if (step + 1) % args.profile_interval != 0:
|
||||
optimizer_start.record()
|
||||
|
||||
scaler.unscale_(optimizer)
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)
|
||||
|
||||
@ -169,40 +100,6 @@ def train_epoch(epoch, wandb):
|
||||
|
||||
optimizer.zero_grad(set_to_none=True)
|
||||
|
||||
if args.profile and (not ddp or dist.get_rank() == 0):
|
||||
optimizer_end.record()
|
||||
|
||||
# 性能分析输出(每profile_interval步)
|
||||
if args.profile and (not ddp or dist.get_rank() == 0) and (step + 1) % args.profile_interval == 0:
|
||||
# 同步CUDA事件以获取准确的计时
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# 计算各阶段耗时
|
||||
data_time = data_start.elapsed_time(data_end)
|
||||
forward_time = forward_start.elapsed_time(forward_end)
|
||||
backward_time = backward_start.elapsed_time(backward_end)
|
||||
|
||||
# 只有在梯度累积步骤完成时才有优化器时间
|
||||
if (step + 1) % args.accumulation_steps == 0:
|
||||
optimizer_time = optimizer_start.elapsed_time(optimizer_end)
|
||||
total_compute_time = forward_time + backward_time + optimizer_time
|
||||
Logger(f"性能分析 - 步骤 {step+1}:")
|
||||
Logger(f" 数据加载时间: {data_time:.2f} ms")
|
||||
Logger(f" 前向传播时间: {forward_time:.2f} ms")
|
||||
Logger(f" 反向传播时间: {backward_time:.2f} ms")
|
||||
Logger(f" 优化器时间: {optimizer_time:.2f} ms")
|
||||
Logger(f" 总计算时间: {total_compute_time:.2f} ms")
|
||||
Logger(f" 计算/数据比例: {total_compute_time / data_time:.2f}")
|
||||
else:
|
||||
# 非梯度累积步骤,没有优化器时间
|
||||
total_compute_time = forward_time + backward_time
|
||||
Logger(f"性能分析 - 步骤 {step+1} (梯度累积中):")
|
||||
Logger(f" 数据加载时间: {data_time:.2f} ms")
|
||||
Logger(f" 前向传播时间: {forward_time:.2f} ms")
|
||||
Logger(f" 反向传播时间: {backward_time:.2f} ms")
|
||||
Logger(f" 总计算时间: {total_compute_time:.2f} ms")
|
||||
Logger(f" 计算/数据比例: {total_compute_time / data_time:.2f}")
|
||||
|
||||
# 打印日志
|
||||
if step % args.log_interval == 0:
|
||||
spend_time = time.time() - start_time
|
||||
@ -217,39 +114,9 @@ def train_epoch(epoch, wandb):
|
||||
spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
|
||||
|
||||
if (wandb is not None) and (not ddp or dist.get_rank() == 0):
|
||||
log_dict = {
|
||||
"loss": loss.item() * args.accumulation_steps,
|
||||
"lr": optimizer.param_groups[-1]['lr'],
|
||||
"epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60
|
||||
}
|
||||
|
||||
# 如果启用了性能分析,也记录性能指标
|
||||
if args.profile and (step + 1) % args.profile_interval == 0:
|
||||
# 基本性能指标
|
||||
perf_dict = {
|
||||
"data_time_ms": data_time,
|
||||
"forward_time_ms": forward_time,
|
||||
"backward_time_ms": backward_time
|
||||
}
|
||||
|
||||
# 只有在梯度累积步骤完成时才有优化器时间
|
||||
if (step + 1) % args.accumulation_steps == 0:
|
||||
total_compute_time = forward_time + backward_time + optimizer_time
|
||||
perf_dict.update({
|
||||
"optimizer_time_ms": optimizer_time,
|
||||
"compute_time_ms": total_compute_time
|
||||
})
|
||||
else:
|
||||
total_compute_time = forward_time + backward_time
|
||||
perf_dict.update({
|
||||
"compute_time_ms": total_compute_time
|
||||
})
|
||||
|
||||
log_dict.update(perf_dict)
|
||||
|
||||
wandb.log(log_dict)
|
||||
|
||||
# 移除通信分析代码
|
||||
wandb.log({"loss": loss.item() * args.accumulation_steps,
|
||||
"lr": optimizer.param_groups[-1]['lr'],
|
||||
"epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})
|
||||
|
||||
# 保存模型
|
||||
if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
|
||||
@ -309,9 +176,6 @@ def init_model(lm_config, pretrained_embedding_path: Optional[str] = None):
|
||||
return model, tokenizer
|
||||
|
||||
|
||||
# 移除通信分析函数
|
||||
|
||||
|
||||
def init_distributed_mode():
|
||||
if not ddp: return #如果没有启用分布式数据并行(DDP),直接返回,不执行任何操作。
|
||||
global ddp_local_rank, DEVICE #声明这两个变量为全局变量,以便在函数外部也能访问它们。
|
||||
@ -330,42 +194,35 @@ if __name__ == "__main__":
|
||||
parser.add_argument("--out_dir", type=str, default="out")
|
||||
# 若要以最快速度实现zero则epochs设置为1轮;否则应当利用有限的数据训练2~6个epochs。
|
||||
parser.add_argument("--epochs", type=int, default=3)
|
||||
parser.add_argument("--batch_size", type=int, default=24)
|
||||
parser.add_argument("--batch_size", type=int, default=8)
|
||||
parser.add_argument("--learning_rate", type=float, default=2e-4)
|
||||
parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu") #如果GPU可用,则使用GPU,否则使用CPU。
|
||||
parser.add_argument("--dtype", type=str, default="bfloat16")
|
||||
parser.add_argument("--use_wandb", default=True, action="store_true")
|
||||
parser.add_argument("--wandb_project", type=str, default="MiniMind-Pretrain")
|
||||
parser.add_argument("--num_workers", type=int, default=48)
|
||||
parser.add_argument("--num_workers", type=int, default=8)
|
||||
parser.add_argument("--ddp", action="store_true")
|
||||
parser.add_argument("--accumulation_steps", type=int, default=32) #梯度累积步数,用于控制梯度更新频率。
|
||||
parser.add_argument("--accumulation_steps", type=int, default=64) #梯度累积步数,用于控制梯度更新频率。
|
||||
parser.add_argument("--grad_clip", type=float, default=1.0) #梯度裁剪阈值,用于防止梯度爆炸。
|
||||
parser.add_argument("--warmup_iters", type=int, default=0) #预热迭代次数,用于控制学习率预热过程。
|
||||
parser.add_argument("--log_interval", type=int, default=100) #日志打印间隔,用于控制日志打印的频率。
|
||||
parser.add_argument("--save_interval", type=int, default=10000) #模型保存间隔,用于控制模型保存的频率。
|
||||
parser.add_argument("--save_interval", type=int, default=100) #模型保存间隔,用于控制模型保存的频率。
|
||||
parser.add_argument('--local_rank', type=int, default=-1) #本地进程编号,用于分布式训练。
|
||||
parser.add_argument('--dim', default=1024, type=int) #模型维度,用于控制模型的大小。
|
||||
parser.add_argument('--dim', default=2048, type=int) #模型维度,用于控制模型的大小。
|
||||
parser.add_argument('--n_layers', default=32, type=int) #层数,用于控制模型层数。
|
||||
parser.add_argument('--max_seq_len', default=1024, type=int) #最大序列长度,用于控制输入序列的最大长度。
|
||||
parser.add_argument('--use_moe', default=False, type=bool) #是否使用MOE,用于控制是否使用MOE。
|
||||
parser.add_argument('--disable_db', action='store_true', help="禁用数据库功能,使用固定值1e-4替代") #禁用数据库功能,启用特殊模式
|
||||
parser.add_argument("--data_path", type=str, default="./dataset/pretrain_hq.jsonl") #数据路径,用于控制数据集的路径。
|
||||
parser.add_argument("--pretrained_embedding_path", type=str, default=None, help="Path to pretrained token embedding weights (.pth file)")
|
||||
# 性能分析相关参数
|
||||
parser.add_argument("--profile", action="store_true", default=True, help="启用性能分析")
|
||||
parser.add_argument("--profile_interval", type=int, default=10, help="性能分析打印间隔(步数)")
|
||||
parser.add_argument("--use_flash_attn", action="store_true", default=True, help="启用FlashAttention")
|
||||
args = parser.parse_args()
|
||||
print(args)
|
||||
|
||||
|
||||
lm_config = LMConfig(
|
||||
dim=args.dim,
|
||||
n_layers=args.n_layers,
|
||||
max_seq_len=args.max_seq_len,
|
||||
use_moe=args.use_moe,
|
||||
disable_db=args.disable_db, # 添加禁用数据库参数
|
||||
flash_attn=args.use_flash_attn # 添加FlashAttention支持
|
||||
disable_db=args.disable_db # 添加禁用数据库参数
|
||||
) #创建LMConfig对象,用于控制模型配置。
|
||||
args.save_dir = os.path.join(args.out_dir) #创建保存目录。
|
||||
os.makedirs(args.save_dir, exist_ok=True) #创建保存目录。
|
||||
@ -410,31 +267,24 @@ if __name__ == "__main__":
|
||||
model, tokenizer = init_model(lm_config, args.pretrained_embedding_path)
|
||||
train_ds = PretrainDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
|
||||
train_sampler = DistributedSampler(train_ds) if ddp else None
|
||||
# 优化DataLoader配置
|
||||
train_loader = DataLoader(
|
||||
train_ds,
|
||||
batch_size=args.batch_size,
|
||||
pin_memory=True,
|
||||
pin_memory_device=f"cuda:{ddp_local_rank}" if ddp else "cuda:0", # 指定pin_memory设备
|
||||
drop_last=False,
|
||||
shuffle=False,
|
||||
num_workers=args.num_workers,
|
||||
sampler=train_sampler,
|
||||
persistent_workers=True if args.num_workers > 0 else False, # 保持worker进程活跃
|
||||
prefetch_factor=2 if args.num_workers > 0 else None # 预取因子
|
||||
sampler=train_sampler
|
||||
)
|
||||
|
||||
# 只有在使用float16时才启用GradScaler,bfloat16不需要
|
||||
scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype == 'float16'))
|
||||
scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16']))
|
||||
optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
|
||||
|
||||
if ddp:
|
||||
model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
|
||||
# 保留find_unused_parameters=True参数,因为模型中确实有未使用的参数
|
||||
# 添加find_unused_parameters=True参数,解决未使用参数的问题
|
||||
model = DistributedDataParallel(model, device_ids=[ddp_local_rank], find_unused_parameters=True)
|
||||
|
||||
# 暂时保留set_detect_anomaly以便调试
|
||||
# 训练稳定后可以注释掉这行来提高速度
|
||||
torch.autograd.set_detect_anomaly(True)
|
||||
iter_per_epoch = len(train_loader)
|
||||
for epoch in range(args.epochs):
|
||||
|
@ -1,398 +0,0 @@
|
||||
import os
|
||||
# 设置环境变量
|
||||
os.environ["WANDB_MODE"] = "offline" # 或者使用 "dryrun"
|
||||
import platform
|
||||
import argparse
|
||||
import time
|
||||
import math
|
||||
import warnings
|
||||
import pandas as pd
|
||||
import torch
|
||||
from torch import optim, nn
|
||||
from torch.utils.data import DataLoader
|
||||
from contextlib import nullcontext
|
||||
from typing import Optional
|
||||
import datetime # Add datetime for time formatting
|
||||
from accelerate import Accelerator
|
||||
from accelerate.utils import set_seed
|
||||
from accelerate.utils import DeepSpeedPlugin
|
||||
from accelerate.utils import DistributedDataParallelKwargs
|
||||
from transformers import AutoTokenizer, get_cosine_schedule_with_warmup
|
||||
|
||||
from model.model import MiniMindLM
|
||||
from model.LMConfig import LMConfig
|
||||
from model.dataset import PretrainDataset
|
||||
|
||||
warnings.filterwarnings('ignore')
|
||||
|
||||
# 日志记录函数
|
||||
def Logger(msg, accelerator=None):
|
||||
# 如果没有提供accelerator,则只在主进程打印
|
||||
if accelerator is None or accelerator.is_main_process:
|
||||
print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {msg}")
|
||||
|
||||
# Helper function to format seconds into HH:MM:SS
|
||||
def format_time(seconds):
|
||||
return str(datetime.timedelta(seconds=int(seconds)))
|
||||
|
||||
# 获取学习率函数
|
||||
def get_lr(it, num_iters, learning_rate):
|
||||
# 余弦学习率衰减
|
||||
return learning_rate * 0.5 * (1.0 + math.cos(math.pi * it / num_iters))
|
||||
|
||||
# 初始化模型函数
|
||||
def init_model(lm_config, pretrained_embedding_path=None):
|
||||
tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
|
||||
model = MiniMindLM(lm_config)
|
||||
|
||||
# 如果提供了预训练的嵌入权重,加载它们
|
||||
if pretrained_embedding_path:
|
||||
Logger(f"Loading pretrained token embeddings from {pretrained_embedding_path}")
|
||||
pretrained_embeddings = torch.load(pretrained_embedding_path)
|
||||
model.tok_embeddings.weight.data.copy_(pretrained_embeddings)
|
||||
model.output.weight.data.copy_(pretrained_embeddings) # 共享权重
|
||||
|
||||
Logger(f'LLM总参数量:{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
|
||||
return model, tokenizer
|
||||
|
||||
def train_epoch(epoch, accelerator, model, train_loader, optimizer, scheduler, args, ctx, overall_start_time):
|
||||
loss_fct = nn.CrossEntropyLoss(reduction='none')
|
||||
epoch_start_time = time.time()
|
||||
total_steps_in_epoch = len(train_loader)
|
||||
total_training_steps = args.epochs * total_steps_in_epoch
|
||||
moe_path = '_moe' if args.use_moe else ''
|
||||
|
||||
# 添加CUDA事件来分析性能 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
data_start = torch.cuda.Event(enable_timing=True)
|
||||
data_end = torch.cuda.Event(enable_timing=True)
|
||||
forward_start = torch.cuda.Event(enable_timing=True)
|
||||
forward_end = torch.cuda.Event(enable_timing=True)
|
||||
backward_start = torch.cuda.Event(enable_timing=True)
|
||||
backward_end = torch.cuda.Event(enable_timing=True)
|
||||
optimizer_start = torch.cuda.Event(enable_timing=True)
|
||||
optimizer_end = torch.cuda.Event(enable_timing=True)
|
||||
|
||||
# 预取数据
|
||||
prefetch_factor = 2 # 预取的批次数
|
||||
data_iter = iter(train_loader)
|
||||
prefetch_batches = []
|
||||
|
||||
# 预取初始批次
|
||||
for _ in range(min(prefetch_factor, len(train_loader))):
|
||||
try:
|
||||
batch = next(data_iter)
|
||||
prefetch_batches.append(batch)
|
||||
except StopIteration:
|
||||
break
|
||||
|
||||
# 在开始循环前初始化日志记录所需变量
|
||||
last_log_time = epoch_start_time
|
||||
|
||||
for step in range(total_steps_in_epoch):
|
||||
try:
|
||||
# 计时数据加载 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
data_start.record()
|
||||
|
||||
# 使用预取的数据
|
||||
if prefetch_batches:
|
||||
X, Y, loss_mask = prefetch_batches.pop(0)
|
||||
else:
|
||||
# 如果预取队列为空,直接加载
|
||||
X, Y, loss_mask = next(data_iter)
|
||||
|
||||
# 异步预取下一批数据
|
||||
if step + prefetch_factor < len(train_loader):
|
||||
try:
|
||||
batch = next(data_iter)
|
||||
prefetch_batches.append(batch)
|
||||
except StopIteration:
|
||||
pass
|
||||
|
||||
# 计时数据加载结束 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
data_end.record()
|
||||
|
||||
# 更新学习率
|
||||
if scheduler is not None:
|
||||
scheduler.step()
|
||||
|
||||
# 计时前向传播 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
forward_start.record()
|
||||
|
||||
# 前向传播
|
||||
with ctx:
|
||||
res = model(X)
|
||||
loss = loss_fct(
|
||||
res.logits.view(-1, res.logits.size(-1)),
|
||||
Y.view(-1)
|
||||
).view(Y.size())
|
||||
loss = (loss * loss_mask).sum() / loss_mask.sum()
|
||||
# 添加辅助损失,如果存在的话
|
||||
try:
|
||||
aux_loss = sum(l.feed_forward.aux_loss for l in model.module.layers
|
||||
if hasattr(l.feed_forward, 'aux_loss'))
|
||||
loss += aux_loss
|
||||
except Exception as e:
|
||||
Logger(f"Warning: Could not add auxiliary loss: {e}")
|
||||
# 如果出错,不添加辅助损失
|
||||
loss = loss / args.accumulation_steps
|
||||
|
||||
# 计时前向传播结束 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
forward_end.record()
|
||||
|
||||
# 计时反向传播 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
backward_start.record()
|
||||
|
||||
# 反向传播
|
||||
# 当使用DeepSpeed时,它会自动处理梯度累积和梯度裁剪
|
||||
accelerator.backward(loss)
|
||||
|
||||
# 计时反向传播结束 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
backward_end.record()
|
||||
|
||||
# 计时优化器步骤 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
optimizer_start.record()
|
||||
|
||||
# 优化器步骤 - 当使用DeepSpeed时,它会自动处理梯度累积和梯度裁剪
|
||||
# 只有在达到累积步数时才会执行优化器步骤
|
||||
# 注意:当使用DeepSpeed时,它会自动处理梯度累积,所以我们不需要检查step % accumulation_steps
|
||||
optimizer.step()
|
||||
|
||||
# 当使用DeepSpeed时,zero_grad()会在step()之后自动调用
|
||||
# 但为了安全起见,我们仍然显式调用它
|
||||
optimizer.zero_grad()
|
||||
|
||||
# 计时优化器步骤结束 (只在主进程进行)
|
||||
if args.profile and accelerator.is_main_process:
|
||||
optimizer_end.record()
|
||||
|
||||
# 打印训练信息 (只在主进程进行)
|
||||
if (step + 1) % args.log_interval == 0 and accelerator.is_main_process:
|
||||
current_time = time.time()
|
||||
# 计算性能指标
|
||||
if args.profile:
|
||||
torch.cuda.synchronize()
|
||||
# 使用自上次日志以来的时间计算性能指标,而不是总时间
|
||||
data_time = data_start.elapsed_time(data_end)
|
||||
forward_time = forward_start.elapsed_time(forward_end)
|
||||
backward_time = backward_start.elapsed_time(backward_end)
|
||||
optimizer_time = optimizer_start.elapsed_time(optimizer_end)
|
||||
iter_time = (current_time - last_log_time) * 1000 / args.log_interval # avg ms per iteration since last log
|
||||
# total_time_ms = data_time + forward_time + backward_time + optimizer_time
|
||||
|
||||
# 打印性能分析
|
||||
if (step + 1) % (args.log_interval * args.profile_interval) == 0:
|
||||
Logger(f"性能分析 (Avg/iter over last {args.log_interval} steps) - "
|
||||
f"Data: {data_time/args.log_interval:.2f}ms, "
|
||||
f"Fwd: {forward_time/args.log_interval:.2f}ms, "
|
||||
f"Bwd: {backward_time/args.log_interval:.2f}ms, "
|
||||
f"Optim: {optimizer_time/args.log_interval:.2f}ms, "
|
||||
f"Iter Time: {iter_time:.2f}ms", accelerator)
|
||||
# 重置事件以便下次测量从0开始
|
||||
data_start = torch.cuda.Event(enable_timing=True)
|
||||
data_end = torch.cuda.Event(enable_timing=True)
|
||||
forward_start = torch.cuda.Event(enable_timing=True)
|
||||
forward_end = torch.cuda.Event(enable_timing=True)
|
||||
backward_start = torch.cuda.Event(enable_timing=True)
|
||||
backward_end = torch.cuda.Event(enable_timing=True)
|
||||
optimizer_start = torch.cuda.Event(enable_timing=True)
|
||||
optimizer_end = torch.cuda.Event(enable_timing=True)
|
||||
|
||||
|
||||
# 计算当前学习率
|
||||
current_lr = optimizer.param_groups[0]['lr']
|
||||
|
||||
# 计算时间
|
||||
epoch_elapsed_time = current_time - epoch_start_time
|
||||
epoch_steps_done = step + 1
|
||||
epoch_avg_step_time = epoch_elapsed_time / epoch_steps_done
|
||||
epoch_remaining_time = epoch_avg_step_time * (total_steps_in_epoch - epoch_steps_done)
|
||||
|
||||
total_elapsed_time = current_time - overall_start_time
|
||||
total_steps_done = epoch * total_steps_in_epoch + epoch_steps_done
|
||||
total_avg_step_time = total_elapsed_time / total_steps_done if total_steps_done > 0 else 0
|
||||
total_remaining_time = total_avg_step_time * (total_training_steps - total_steps_done) if total_steps_done > 0 else 0
|
||||
|
||||
# 计算训练速度 (基于最近的log_interval)
|
||||
interval_elapsed_time = current_time - last_log_time
|
||||
tokens_processed_interval = args.log_interval * args.batch_size * args.max_seq_len
|
||||
tokens_per_sec = tokens_processed_interval / interval_elapsed_time if interval_elapsed_time > 0 else 0
|
||||
last_log_time = current_time # 更新上次日志时间
|
||||
|
||||
Logger(f"Epoch {epoch+1}/{args.epochs}, Step {step+1}/{total_steps_in_epoch}, "
|
||||
f"Loss: {loss.item()*args.accumulation_steps:.4f}, "
|
||||
f"LR: {current_lr:.6f}, "
|
||||
f"Speed: {tokens_per_sec:.2f} tokens/sec | "
|
||||
f"Epoch Time Left: {format_time(epoch_remaining_time)} | "
|
||||
f"Total Time Left: {format_time(total_remaining_time)}", accelerator)
|
||||
|
||||
# 保存模型 (只在主进程进行)
|
||||
if (step + 1) % args.save_interval == 0 and accelerator.is_main_process:
|
||||
# 使用函数开始处定义的moe_path变量
|
||||
ckp = f'{args.save_dir}/pretrain_{args.dim}{moe_path}.pth'
|
||||
|
||||
# 获取解包后的模型
|
||||
unwrapped_model = accelerator.unwrap_model(model)
|
||||
|
||||
# 保存模型参数
|
||||
accelerator.save(unwrapped_model.state_dict(), ckp)
|
||||
Logger(f"Model saved to {ckp}", accelerator)
|
||||
|
||||
except Exception as e:
|
||||
Logger(f"Error in training step: {e}", accelerator)
|
||||
import traceback
|
||||
Logger(traceback.format_exc(), accelerator)
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="MiniMind Pretraining with Accelerate")
|
||||
parser.add_argument("--out_dir", type=str, default="out")
|
||||
parser.add_argument("--epochs", type=int, default=3)
|
||||
parser.add_argument("--batch_size", type=int, default=24)
|
||||
parser.add_argument("--learning_rate", type=float, default=2e-4)
|
||||
parser.add_argument("--dtype", type=str, default="bfloat16")
|
||||
parser.add_argument("--use_wandb", default=True, action="store_true")
|
||||
parser.add_argument("--wandb_project", type=str, default="MiniMind-Pretrain")
|
||||
parser.add_argument("--num_workers", type=int, default=48)
|
||||
parser.add_argument("--accumulation_steps", type=int, default=32)
|
||||
parser.add_argument("--grad_clip", type=float, default=1.0)
|
||||
parser.add_argument("--warmup_iters", type=int, default=0)
|
||||
parser.add_argument("--log_interval", type=int, default=100)
|
||||
parser.add_argument("--save_interval", type=int, default=10000)
|
||||
parser.add_argument('--dim', default=1024, type=int)
|
||||
parser.add_argument('--n_layers', default=32, type=int)
|
||||
parser.add_argument('--max_seq_len', default=1024, type=int)
|
||||
parser.add_argument('--use_moe', default=False, type=bool)
|
||||
parser.add_argument('--disable_db', action='store_true', help="禁用数据库功能,使用固定值1e-4替代")
|
||||
parser.add_argument("--data_path", type=str, default="./dataset/pretrain_hq.jsonl")
|
||||
parser.add_argument("--pretrained_embedding_path", type=str, default=None, help="Path to pretrained token embedding weights (.pth file)")
|
||||
parser.add_argument("--profile", action="store_true", default=True, help="启用性能分析")
|
||||
parser.add_argument("--profile_interval", type=int, default=10, help="性能分析打印间隔(步数)")
|
||||
parser.add_argument("--use_flash_attn", action="store_true", default=True, help="启用FlashAttention")
|
||||
parser.add_argument("--knowlwdge_num", type=int, default=64*64,help="知识库的数据数目")
|
||||
parser.add_argument("--knowlwdge_length", type=int, default=8,help="知识库的句子长度")
|
||||
args = parser.parse_args()
|
||||
|
||||
# 初始化accelerator
|
||||
# 设置ddp_kwargs以处理未使用的参数
|
||||
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
|
||||
# 创建DeepSpeedPlugin对象
|
||||
ds_plugin = DeepSpeedPlugin(
|
||||
gradient_accumulation_steps=args.accumulation_steps,
|
||||
gradient_clipping=args.grad_clip,
|
||||
zero_stage=2, # 使用ZeRO-2优化
|
||||
offload_optimizer_device="cpu", # 将优化器状态卸载到CPU
|
||||
offload_param_device="none", # 不将参数卸载到CPU
|
||||
)
|
||||
accelerator = Accelerator(
|
||||
kwargs_handlers=[ddp_kwargs],
|
||||
deepspeed_plugin=ds_plugin,
|
||||
mixed_precision="bf16" if args.dtype == "bfloat16" else "fp16" if args.dtype == "float16" else "no"
|
||||
)
|
||||
|
||||
# 设置随机种子
|
||||
set_seed(1337 + accelerator.process_index)
|
||||
|
||||
# 配置模型
|
||||
lm_config = LMConfig(
|
||||
dim=args.dim,
|
||||
n_layers=args.n_layers,
|
||||
max_seq_len=args.max_seq_len,
|
||||
use_moe=args.use_moe,
|
||||
disable_db=args.disable_db,
|
||||
flash_attn=args.use_flash_attn,
|
||||
knowlwdge_num=args.knowlwdge_num,
|
||||
knowlwdge_length=args.knowlwdge_length
|
||||
)
|
||||
|
||||
# 创建保存目录
|
||||
args.save_dir = os.path.join(args.out_dir)
|
||||
if accelerator.is_main_process:
|
||||
os.makedirs(args.save_dir, exist_ok=True)
|
||||
os.makedirs(args.out_dir, exist_ok=True)
|
||||
|
||||
# 计算每次迭代的token数量
|
||||
tokens_per_iter = args.batch_size * lm_config.max_seq_len
|
||||
Logger(f"tokens_per_iter: {tokens_per_iter}", accelerator)
|
||||
|
||||
# 设置数据类型
|
||||
pt_dtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[args.dtype]
|
||||
|
||||
# 设置wandb运行名称
|
||||
args.wandb_run_name = f"MiniMind-Pretrain-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
|
||||
|
||||
# 设置自动混合精度上下文
|
||||
ctx = nullcontext() if accelerator.device.type == "cpu" else torch.cuda.amp.autocast(dtype=pt_dtype)
|
||||
|
||||
# 初始化模型和tokenizer
|
||||
model, tokenizer = init_model(lm_config, args.pretrained_embedding_path)
|
||||
# 将accelerator传递给init_model函数中的Logger调用
|
||||
Logger(f'模型初始化完成', accelerator)
|
||||
|
||||
# 处理位置编码张量问题
|
||||
# 我们已经将复数版本的pos_cis替换为实数版本的pos_cis_real
|
||||
# 但为了安全起见,我们仍然将其设置为不参与分布式训练
|
||||
if hasattr(model, "pos_cis_real"):
|
||||
Logger(f'检测到pos_cis_real实数张量,将其设置为不参与分布式训练', accelerator)
|
||||
# 设置模型的_ddp_params_and_buffers_to_ignore属性
|
||||
model._ddp_params_and_buffers_to_ignore = {"pos_cis_real"}
|
||||
# 兼容旧版本,检查是否仍有pos_cis
|
||||
elif hasattr(model, "pos_cis"):
|
||||
Logger(f'检测到pos_cis复数张量,将其设置为不参与分布式训练', accelerator)
|
||||
# 设置模型的_ddp_params_and_buffers_to_ignore属性
|
||||
model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
|
||||
|
||||
# 创建数据集和数据加载器
|
||||
train_ds = PretrainDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
|
||||
train_loader = DataLoader(
|
||||
train_ds,
|
||||
batch_size=args.batch_size,
|
||||
pin_memory=True,
|
||||
drop_last=False,
|
||||
shuffle=True,
|
||||
num_workers=args.num_workers,
|
||||
persistent_workers=True if args.num_workers > 0 else False,
|
||||
prefetch_factor=2 if args.num_workers > 0 else None
|
||||
)
|
||||
|
||||
# 创建优化器
|
||||
optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
|
||||
|
||||
# 创建学习率调度器
|
||||
total_steps = len(train_loader) * args.epochs
|
||||
warmup_steps = args.warmup_iters if args.warmup_iters > 0 else int(0.1 * total_steps)
|
||||
scheduler = get_cosine_schedule_with_warmup(
|
||||
optimizer,
|
||||
num_warmup_steps=warmup_steps,
|
||||
num_training_steps=total_steps
|
||||
)
|
||||
|
||||
# 准备训练
|
||||
model, optimizer, train_loader, scheduler = accelerator.prepare(
|
||||
model, optimizer, train_loader, scheduler
|
||||
)
|
||||
|
||||
# 初始化wandb
|
||||
if args.use_wandb and accelerator.is_main_process:
|
||||
import wandb
|
||||
wandb.init(project=args.wandb_project, name=args.wandb_run_name, config=args)
|
||||
else:
|
||||
wandb = None
|
||||
|
||||
# 训练循环
|
||||
overall_start_time = time.time() # Record overall start time
|
||||
for epoch in range(args.epochs):
|
||||
train_epoch(epoch, accelerator, model, train_loader, optimizer, scheduler, args, ctx, overall_start_time) # Pass overall start time
|
||||
|
||||
# 关闭wandb
|
||||
if args.use_wandb and accelerator.is_main_process:
|
||||
wandb.finish()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|