34 changed files with 1907 additions and 1201 deletions
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@ -0,0 +1,128 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series
+of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.
--- a/README.md
+++ b/README.md
@ -0,0 +1,199 @@
+<div align="center">
+
+![logo](./images/logo.png)
+
+</div>
+
+<div align="center">
+
+![visitors](https://visitor-badge.laobi.icu/badge?page_id=jingyaogong/minimind)
+[![GitHub Repo stars](https://img.shields.io/github/stars/jingyaogong/minimind?style=social)](https://github.com/jingyaogong/minimind/stargazers)
+[![GitHub Code License](https://img.shields.io/github/license/jingyaogong/minimind)](LICENSE)
+[![GitHub last commit](https://img.shields.io/github/last-commit/jingyaogong/minimind)](https://github.com/jingyaogong/minimind/commits/master)
+[![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/jingyaogong/minimind/pulls)
+[![Collection](https://img.shields.io/badge/🤗-MiniMind%20%20Collection-blue)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
+
+</div>
+
+
+# 📌 数据介绍
+
+## Ⅰ Tokenizer
+
+分词器将单词从自然语言通过“词典”映射到`0, 1, 36`这样的数字，可以理解为数字就代表了单词在“词典”中的页码。
+可以选择自己构造词表训练一个“词典”，代码可见`./scripts/train_tokenizer.py`（仅供学习参考，若非必要无需再自行训练，MiniMind已自带tokenizer）。
+或者选择比较出名的开源大模型分词器，
+正如同直接用新华/牛津词典的优点是token编码压缩率很好，缺点是页数太多，动辄数十万个词汇短语；
+自己训练的分词器，优点是词表长度和内容随意控制，缺点是压缩率很低（例如"hello"也许会被拆分为"h e l l o"
+五个独立的token），且生僻词难以覆盖。
+“词典”的选择固然很重要，LLM的输出本质上是SoftMax到词典N个词的多分类问题，然后通过“词典”解码到自然语言。
+因为MiniMind体积需要严格控制，为了避免模型头重脚轻（词嵌入embedding层参数在LLM占比太高），所以词表长度短短益善。
+
+<details style="color:rgb(128,128,128)">
+<summary>Tokenizer介绍</summary>
+
+第三方强大的开源模型例如Yi、qwen、chatglm、mistral、Llama3的tokenizer词表长度如下：
+
+<table>
+  <tr><th>Tokenizer模型</th><th>词表大小</th><th>来源</th></tr>
+  <tr><td>yi tokenizer</td><td>64,000</td><td>01万物（中国）</td></tr>
+  <tr><td>qwen2 tokenizer</td><td>151,643</td><td>阿里云（中国）</td></tr>
+  <tr><td>glm tokenizer</td><td>151,329</td><td>智谱AI（中国）</td></tr>
+  <tr><td>mistral tokenizer</td><td>32,000</td><td>Mistral AI（法国）</td></tr>
+  <tr><td>llama3 tokenizer</td><td>128,000</td><td>Meta（美国）</td></tr>
+  <tr><td>minimind tokenizer</td><td>6,400</td><td>自定义</td></tr>
+</table>
+
+> 👉2024-09-17更新：为了防止过去的版本歧义&控制体积，minimind所有模型均使用minimind_tokenizer分词，废弃所有mistral_tokenizer版本。
+
+```
+# 一些自言自语
+> 尽管minimind_tokenizer长度很小，编解码效率弱于qwen2、glm等中文友好型分词器。
+> 但minimind模型选择了自己训练的minimind_tokenizer作为分词器，以保持整体参数轻量，避免编码层和计算层占比失衡，头重脚轻，因为minimind的词表大小只有6400。
+> 且minimind在实际测试中没有出现过生僻词汇解码失败的情况，效果良好。
+> 由于自定义词表压缩长度到6400，使得LLM总参数量最低只有25.8M。
+> 训练数据`tokenizer_train.jsonl`均来自于`匠数大模型数据集`，这部分数据相对次要，如需训练可以自由选择。
+```
+
+</details>
+
+## Ⅱ Pretrain数据
+
+经历了MiniMind-V1的低质量预训练数据，导致模型胡言乱语的教训，`2025-02-05` 之后决定不再采用大规模无监督的数据集做预训练。
+进而尝试把[匠数大模型数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)的中文部分提取出来，
+清洗出字符`<512`长度的大约1.6GB的语料直接拼接成预训练数据 `pretrain_hq.jsonl`，hq即为high
+quality（当然也还不算high，提升数据质量无止尽）。
+
+文件`pretrain_hq.jsonl` 数据格式为
+
+```bash
+{"text": "如何才能摆脱拖延症？ 治愈拖延症并不容易，但以下建议可能有所帮助..."}
+```
+
+## Ⅲ SFT数据
+
+[匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)
+“是一个完整、格式统一、安全的大模型训练和研究资源。
+从网络上的公开数据源收集并整理了大量开源数据集，对其进行了格式统一，数据清洗，
+包含10M条数据的中文数据集和包含2M条数据的英文数据集。”
+以上是官方介绍，下载文件后的数据总量大约在4B tokens，肯定是适合作为中文大语言模型的SFT数据的。
+但是官方提供的数据格式很乱，全部用来sft代价太大。
+我将把官方数据集进行了二次清洗，把含有符号污染和噪声的条目去除；另外依然只保留了总长度`<512`
+的内容，此阶段希望通过大量对话补充预训练阶段欠缺的知识。
+导出文件为`sft_512.jsonl`(~7.5GB)。
+
+[Magpie-SFT数据集](https://www.modelscope.cn/organization/Magpie-Align)
+收集了~1M条来自Qwen2/2.5的高质量对话，我将这部分数据进一步清洗，把总长度`<2048`的部分导出为`sft_2048.jsonl`(~9GB)。
+长度`<1024`的部分导出为`sft_1024.jsonl`(~5.5GB)，用大模型对话数据直接进行sft就属于“黑盒蒸馏”的范畴。
+
+进一步清洗前两步sft的数据（只保留中文字符占比高的内容），筛选长度`<512`的对话，得到`sft_mini_512.jsonl`(~1.2GB)。
+
+所有sft文件 `sft_X.jsonl` 数据格式均为
+
+```text
+{
+    "conversations": [
+        {"role": "user", "content": "你好"},
+        {"role": "assistant", "content": "你好！"},
+        {"role": "user", "content": "再见"},
+        {"role": "assistant", "content": "再见！"}
+    ]
+}
+```
+
+## Ⅳ RLHF数据
+
+来自[Magpie-DPO数据集](https://www.modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1)
+大约200k条偏好数据（均是英文）生成自Llama3.1-70B/8B，可以用于训练奖励模型，优化模型回复质量，使其更加符合人类偏好。
+这里将数据总长度`<3000`的内容重组为`dpo.jsonl`(~0.9GB)，包含`chosen`和`rejected`两个字段，`chosen`
+为偏好的回复，`rejected`为拒绝的回复。
+
+文件 `dpo.jsonl` 数据格式为
+
+```text
+{
+  "chosen": [
+    {"content": "Q", "role": "user"}, 
+    {"content": "good answer", "role": "assistant"}
+  ], 
+  "rejected": [
+    {"content": "Q", "role": "user"}, 
+    {"content": "bad answer", "role": "assistant"}
+  ]
+}
+```
+
+## Ⅴ Reason数据集：
+
+不得不说2025年2月谁能火的过DeepSeek...
+也激发了我对RL引导的推理模型的浓厚兴趣，目前已经用Qwen2.5复现了R1-Zero。
+如果有时间+效果work（但99%基模能力不足）我会在之后更新MiniMind基于RL训练的推理模型而不是蒸馏模型。
+时间有限，最快的低成本方案依然是直接蒸馏（黑盒方式）。
+耐不住R1太火，短短几天就已经存在一些R1的蒸馏数据集[R1-Llama-70B](https://www.modelscope.cn/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B)、[R1-Distill-SFT](https://www.modelscope.cn/datasets/AI-ModelScope/R1-Distill-SFT)、
+[Alpaca-Distill-R1](https://huggingface.co/datasets/shareAI/Alpaca-Distill-R1-ZH)、
+[deepseek_r1_zh](https://huggingface.co/datasets/jinliuxi/deepseek_r1_zh)等等，纯中文的数据可能比较少。
+最终整合它们，导出文件为`r1_mix_1024.jsonl`，数据格式和`sft_X.jsonl`一致。
+
+## Ⅵ 更多数据集
+
+目前已经有[HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM)
+在收集和梳理中文LLM相关的开源模型、应用、数据集及教程等资料，并持续更新这方面的最新进展。全面且专业，Respect！
+
+---
+
+## Ⅷ 数据集下载
+
+> [!NOTE]
+> 2025-02-05后，开源MiniMind最终训练所用的所有数据集，因此无需再自行预处理大规模数据集，避免重复性的数据处理工作。
+
+MiniMind训练数据集 ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong))
+
+> 无需全部clone，可单独下载所需的文件
+
+将下载的数据集文件放到`./dataset/`目录下（✨为推荐的必须项）
+
+```bash
+./dataset/
+├── dpo.jsonl (909MB)
+├── lora_identity.jsonl (22.8KB)
+├── lora_medical.jsonl (34MB)
+├── pretrain_hq.jsonl (1.6GB, ✨)
+├── r1_mix_1024.jsonl (340MB)
+├── sft_1024.jsonl (5.6GB)
+├── sft_2048.jsonl (9GB)
+├── sft_512.jsonl (7.5GB)
+├── sft_mini_512.jsonl (1.2GB, ✨)
+└── tokenizer_train.jsonl (1GB)
+```
+
+<details style="color:rgb(128,128,128)">
+<summary>注：各数据集简介</summary>
+
+* `dpo.jsonl` --RLHF阶段数据集
+* `lora_identity.jsonl` --自我认知数据集（例如：你是谁？我是minimind...），推荐用于lora训练（亦可用于全参SFT，勿被名字局限）
+* `lora_medical.jsonl` --医疗问答数据集，推荐用于lora训练（亦可用于全参SFT，勿被名字局限）
+* `pretrain_hq.jsonl`✨ --预训练数据集，整合自jiangshu科技
+* `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B蒸馏数据，每条数据字符最大长度为1024（因此训练时设置max_seq_len=1024）
+* `sft_1024.jsonl` --整合自Qwen2.5蒸馏数据（是sft_2048的子集），每条数据字符最大长度为1024（因此训练时设置max_seq_len=1024）
+* `sft_2048.jsonl` --整合自Qwen2.5蒸馏数据，每条数据字符最大长度为2048（因此训练时设置max_seq_len=2048）
+* `sft_512.jsonl` --整合自匠数科技SFT数据，每条数据字符最大长度为512（因此训练时设置max_seq_len=512）
+* `sft_mini_512.jsonl`✨ --极简整合自匠数科技SFT数据+Qwen2.5蒸馏数据（用于快速训练Zero模型），每条数据字符最大长度为512（因此训练时设置max_seq_len=512）
+* `tokenizer_train.jsonl` --均来自于`匠数大模型数据集`，这部分数据相对次要，（不推荐自己重复训练tokenizer，理由如上）如需自己训练tokenizer可以自由选择数据集。
+
+</details>
+
+
+![dataset](./images/dataset.jpg)
+
+<details style="color:rgb(128,128,128)">
+<summary>说明 & 推荐训练方案</summary>
+
+* MiniMind2 Series均经过共约20GB语料训练，大约4B tokens，即对应上面的数据组合训练结果（开销：💰💰💰💰💰💰💰💰，效果：😊😊😊😊😊😊）
+
+* 想要最快速度从0实现Zero模型，推荐使用`pretrain_hq.jsonl` + `sft_mini_512.jsonl` 的数据组合，具体花销和效果可查看下文表格（开销：💰，效果：😊😊）
+
+* 推荐具备一定算力资源或更在意效果的朋友可以考虑前者完整复现MiniMind2；仅有单卡GPU或在乎短时间快速复现的朋友强烈推荐后者；
+
+* 【折中方案】亦可选择例如`sft_mini_512.jsonl`、`sft_1024.jsonl`中等规模数据进行自由组合训练（开销：💰💰💰，效果：😊😊😊😊）。
+
+</details>
--- a/README_accelerate.md
+++ b/README_accelerate.md
@ -1,126 +0,0 @@
-# 使用Accelerate+DeepSpeed进行分布式训练
-
-本文档介绍如何使用Accelerate和DeepSpeed进行MiniMind模型的分布式训练。
-
-## 环境准备
-
-首先，确保安装了必要的依赖：
-
-```bash
-pip install accelerate deepspeed
-```
-
-## 配置文件说明
-
-### 1. DeepSpeed配置文件 (ds_config.json)
-
-DeepSpeed配置文件定义了优化器、学习率调度器和ZeRO优化等参数。主要配置包括：
-
- **ZeRO优化**：使用ZeRO-2进行优化，可以减少GPU内存使用
- **优化器设置**：使用AdamW优化器
- **混合精度训练**：支持FP16和BF16
- **梯度累积**：通过"auto"自动设置，与训练脚本参数保持一致
-
-### 2. Accelerate配置文件 (accelerate_config.yaml)
-
-Accelerate配置文件定义了分布式训练的基本设置，包括：
-
- **分布式类型**：使用DeepSpeed
- **混合精度**：使用BF16
- **进程数量**：设置为4（可根据GPU数量调整）
- **DeepSpeed配置**：指向ds_config.json文件
-
-## 训练脚本说明
-
-新的训练脚本`train_pretrain_accelerate.py`基于原有的`train_pretrain.py`修改而来，主要变化包括：
-
-1. 使用Accelerator替代了PyTorch原生的分布式功能
-2. 移除了torchrun相关的分布式初始化代码
-3. 使用Accelerator的API进行模型、优化器和数据加载器的准备
-4. 使用Accelerator的API进行反向传播和梯度裁剪
-5. 处理了位置编码和未使用参数的问题
-
-## 启动训练
-
-有两种方式启动训练：
-
-### 方法1：使用预先配置的accelerate配置文件
-
-```bash
-accelerate launch --config_file accelerate_config.yaml train_pretrain_accelerate.py \
-    --epochs 3 \
-    --batch_size 24 \
-    --learning_rate 2e-4 \
-    --dtype bfloat16 \
-    --accumulation_steps 32 \
-    --grad_clip 1.0 \
-    --log_interval 100 \
-    --save_interval 10000 \
-    --dim 1024 \
-    --n_layers 32 \
-    --max_seq_len 1024 \
-    --use_flash_attn \
-    --profile \
-    --profile_interval 10
-```
-
-### 方法2：使用命令行参数直接配置accelerate
-
-```bash
-CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
-    --multi_gpu \
-    --num_processes=4 \
-    --mixed_precision=bf16 \
-    --main_process_port=29500 \
-    --deepspeed_config_file ds_config.json \
-    train_pretrain_accelerate.py \
-    --epochs 3 \
-    --batch_size 24 \
-    --learning_rate 2e-4 \
-    --dtype bfloat16 \
-    --accumulation_steps 32 \
-    --grad_clip 1.0 \
-    --log_interval 100 \
-    --save_interval 10000 \
-    --dim 1024 \
-    --n_layers 32 \
-    --max_seq_len 1024 \
-    --use_flash_attn \
-    --profile \
-    --profile_interval 10
-```
-
-也可以直接使用提供的脚本：
-
-```bash
-bash run_accelerate.sh
-```
-
-## Accelerate与DeepSpeed配置的关系
-
-1. **Accelerate**是一个高级API，用于简化分布式训练的设置和启动，它可以与多种分布式训练后端（如DeepSpeed、FSDP等）一起使用。
-
-2. **DeepSpeed**是一个优化库，专注于大规模模型训练的内存优化和性能提升，提供了ZeRO优化等功能。
-
-3. **配置关系**：
-   - Accelerate配置文件（YAML）定义了使用哪种分布式后端以及基本的分布式设置
-   - DeepSpeed配置文件（JSON）定义了DeepSpeed特有的优化参数
-   - Accelerate通过`deepspeed_config_file`参数引用DeepSpeed配置文件
-
-## 注意事项
-
-1. **位置编码处理**：
-   - 在模型中，`pos_cis`是一个复数张量，在分布式训练中需要特别处理
-   - 在新的训练脚本中，我们使用Accelerator的API来处理这个问题，不再需要`_ddp_params_and_buffers_to_ignore`
-
-2. **未使用参数处理**：
-   - 原代码中使用`find_unused_parameters=True`来处理未使用的参数
-   - 在新的训练脚本中，我们直接使用Accelerator的API，它会自动处理这个问题
-
-3. **混合精度训练**：
-   - DeepSpeed配置文件中的`fp16`和`bf16`设置为`"auto"`
-   - 实际使用的精度由Accelerate的`--mixed_precision`参数决定
-
-4. **梯度累积**：
-   - DeepSpeed配置文件中的`gradient_accumulation_steps`设置为`"auto"`
-   - 实际的梯度累积步数由训练脚本的`--accumulation_steps`参数决定
--- a/README_en.md
+++ b/README_en.md
--- a/accelerate_config.yaml
+++ b/accelerate_config.yaml
@ -1,17 +0,0 @@
-compute_environment: LOCAL_MACHINE
-deepspeed_config:
-  deepspeed_config_file: ds_config.json
-  zero3_init_flag: false
-distributed_type: DEEPSPEED
-downcast_bf16: 'no'
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 4
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
--- a/ds_config.json
+++ b/ds_config.json
@ -1,49 +0,0 @@
-{
-    "train_batch_size": "auto",
-    "train_micro_batch_size_per_gpu": "auto",
-    "gradient_accumulation_steps": "auto",
-    "gradient_clipping": "auto",
-    "zero_optimization": {
-        "stage": 2,
-        "offload_optimizer": {
-            "device": "cpu",
-            "pin_memory": true
-        },
-        "allgather_partitions": true,
-        "allgather_bucket_size": 5e8,
-        "overlap_comm": true,
-        "reduce_scatter": true,
-        "reduce_bucket_size": 5e8,
-        "contiguous_gradients": true
-    },
-    "fp16": {
-        "enabled": "auto",
-        "loss_scale": 0,
-        "loss_scale_window": 1000,
-        "initial_scale_power": 16,
-        "hysteresis": 2,
-        "min_loss_scale": 1
-    },
-    "bf16": {
-        "enabled": "auto"
-    },
-    "optimizer": {
-        "type": "AdamW",
-        "params": {
-            "lr": "auto",
-            "betas": "auto",
-            "eps": "auto",
-            "weight_decay": "auto"
-        }
-    },
-    "scheduler": {
-        "type": "WarmupLR",
-        "params": {
-            "warmup_min_lr": "auto",
-            "warmup_max_lr": "auto",
-            "warmup_num_steps": "auto"
-        }
-    },
-    "steps_per_print": 100,
-    "wall_clock_breakdown": false
-}
--- a/images/1-wiki.png
+++ b/images/1-wiki.png
--- a/images/2-wiki.png
+++ b/images/2-wiki.png
--- a/images/3-wiki.png
+++ b/images/3-wiki.png
--- a/images/4-wiki.png
+++ b/images/4-wiki.png
--- a/images/5-wiki.png
+++ b/images/5-wiki.png
--- a/images/LLM-structure-moe.png
+++ b/images/LLM-structure-moe.png
--- a/images/LLM-structure.png
+++ b/images/LLM-structure.png
--- a/images/and_huggingface.png
+++ b/images/and_huggingface.png
--- a/images/and_modelscope.png
+++ b/images/and_modelscope.png
--- a/images/compare_radar.png
+++ b/images/compare_radar.png
--- a/images/dataset.jpg
+++ b/images/dataset.jpg
--- a/images/gpt3_config.png
+++ b/images/gpt3_config.png
--- a/images/logo.png
+++ b/images/logo.png
--- a/images/logo2.png
+++ b/images/logo2.png
--- a/images/minimind2.gif
+++ b/images/minimind2.gif
--- a/images/pre_512_loss.png
+++ b/images/pre_512_loss.png
--- a/images/pre_768_loss.png
+++ b/images/pre_768_loss.png
--- a/images/sft_512_loss.png
+++ b/images/sft_512_loss.png
--- a/images/sft_768_loss.png
+++ b/images/sft_768_loss.png
--- a/model/LMConfig.py
+++ b/model/LMConfig.py
@ -36,9 +36,6 @@ class LMConfig(PretrainedConfig):
            aux_loss_alpha: float = 0.1,
            seq_aux: bool = True,
            norm_topk_prob: bool = True,
-            ####################################################
-            knowlwdge_num: int = 64*64,
-            knowlwdge_length: int = 8,
            **kwargs,
    ):
        self.dim = dim
@ -69,7 +66,4 @@ class LMConfig(PretrainedConfig):
        self.aux_loss_alpha = aux_loss_alpha  # 辅助损失的alpha参数
        self.seq_aux = seq_aux  # 是否在序列级别上计算辅助损失
        self.norm_topk_prob = norm_topk_prob  # 是否标准化top-k概率
-        ####################################################
-        self.knowlwdge_num = knowlwdge_num
-        self.knowlwdge_length = knowlwdge_length
        super().__init__(**kwargs)
--- a/model/dataset.py
+++ b/model/dataset.py
@ -10,7 +10,7 @@ from sklearn.model_selection import train_test_split
 import os
 import ast

-os.environ["TOKENIZERS_PARALLELISM"] = "true"
+os.environ["TOKENIZERS_PARALLELISM"] = "false"


 class PretrainDataset(Dataset):
--- a/model/model.py
+++ b/model/model.py
@ -31,7 +31,7 @@ class RMSNorm(torch.nn.Module):
    def forward(self, x):
        return self.weight * self._norm(x.float()).type_as(x)

-# precompute_pos_cis 函数用于预计算位置编码（复数版本）。
+# precompute_pos_cis 函数用于预计算位置编码。
 def precompute_pos_cis(dim: int, end: int = int(32 * 1024), theta: float = 1e6):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
@ -39,7 +39,7 @@ def precompute_pos_cis(dim: int, end: int = int(32 * 1024), theta: float = 1e6):
    pos_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return pos_cis

-# apply_rotary_emb 函数用于应用旋转位置编码（复数版本）。
+# apply_rotary_emb 函数用于应用旋转位置编码。
 def apply_rotary_emb(xq, xk, pos_cis):
    def unite_shape(pos_cis, x):
        ndim = x.ndim
@ -55,92 +55,6 @@ def apply_rotary_emb(xq, xk, pos_cis):
    xk_out = torch.view_as_real(xk_ * pos_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

-# precompute_pos_cis_real 函数用于预计算位置编码（实数版本）。
-def precompute_pos_cis_real(dim: int, end: int = int(32 * 1024), theta: float = 1e6):
-    """使用实数张量实现位置编码，避免使用复数张量
-
-    这个函数与precompute_pos_cis完全等价，但使用实数张量而非复数张量。
-    原始函数生成形状为[seq_len, dim//2]的复数张量，其中实部全为1，虚部为旋转角度。
-    这个函数生成形状为[seq_len, dim]的实数张量，其中偶数索引是cos(角度)，奇数索引是sin(角度)。
-    """
-    # 确保dim是偶数
-    if dim % 2 != 0:
-        raise ValueError(f"维度必须是偶数，但得到了 {dim}")
-
-    # 复制原始函数的频率计算逻辑
-    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
-    t = torch.arange(end, device=freqs.device)
-    freqs = torch.outer(t, freqs).float()
-
-    # 计算cos和sin值
-    # 在复数版本中，pos_cis = torch.polar(torch.ones_like(freqs), freqs)
-    # 等价于 cos(freqs) + i*sin(freqs)
-    cos = torch.cos(freqs)
-    sin = torch.sin(freqs)
-
-    # 创建实数张量，交错排列cos和sin
-    pos_emb = torch.zeros((end, dim), device=freqs.device)
-    pos_emb[:, 0::2] = cos  # 偶数索引放cos
-    pos_emb[:, 1::2] = sin  # 奇数索引放sin
-
-    return pos_emb
-
-# apply_rotary_emb_real 函数用于应用旋转位置编码（实数版本）。
-def apply_rotary_emb_real(xq, xk, pos_emb):
-    """使用实数张量实现旋转位置编码，避免使用复数张量
-
-    这个函数与apply_rotary_emb完全等价，但使用实数张量而非复数张量。
-    原始函数将输入张量转换为复数形式，与位置编码相乘，然后再转回实数形式。
-    这个函数直接使用实数运算实现相同的旋转操作。
-    """
-    # 获取形状信息
-    bsz, seq_len, n_heads, head_dim = xq.shape
-
-    # 确保pos_emb形状正确
-    assert pos_emb.shape[0] >= seq_len, f"位置编码长度 {pos_emb.shape[0]} 小于序列长度 {seq_len}"
-    assert pos_emb.shape[1] == head_dim, f"位置编码维度 {pos_emb.shape[1]} 与头维度 {head_dim} 不匹配"
-
-    # 截取需要的位置编码长度
-    pos_emb = pos_emb[:seq_len]
-
-    # 将pos_emb调整为广播形状 [1, seq_len, 1, head_dim]
-    pos_emb = pos_emb.unsqueeze(0).unsqueeze(2)
-
-    # 将head_dim分成两半
-    half_head_dim = head_dim // 2
-
-    # 提取cos和sin值（偶数索引是cos，奇数索引是sin）
-    cos = pos_emb[..., 0::2]
-    sin = pos_emb[..., 1::2]
-
-    # 将xq和xk重新排列，以便进行旋转操作
-    # 原始复数版本中，xq和xk被重塑为复数张量，其中实部和虚部交错排列
-    # 在实数版本中，我们需要将偶数索引和奇数索引分开处理
-
-    # 分离偶数和奇数索引
-    xq_even = xq[..., 0::2]  # 偶数索引，对应复数的实部
-    xq_odd = xq[..., 1::2]   # 奇数索引，对应复数的虚部
-    xk_even = xk[..., 0::2]
-    xk_odd = xk[..., 1::2]
-
-    # 应用旋转（等价于复数乘法）
-    # (a + bi)(cos + sin*i) = (a*cos - b*sin) + (a*sin + b*cos)i
-    # 其中a是偶数索引，b是奇数索引
-    xq_out_even = xq_even * cos - xq_odd * sin  # 新的偶数索引（实部）
-    xq_out_odd = xq_even * sin + xq_odd * cos   # 新的奇数索引（虚部）
-    xk_out_even = xk_even * cos - xk_odd * sin
-    xk_out_odd = xk_even * sin + xk_odd * cos
-
-    # 重新组合偶数和奇数索引
-    xq_out = torch.zeros_like(xq)
-    xk_out = torch.zeros_like(xk)
-    xq_out[..., 0::2] = xq_out_even
-    xq_out[..., 1::2] = xq_out_odd
-    xk_out[..., 0::2] = xk_out_even
-    xk_out[..., 1::2] = xk_out_odd
-
-    return xq_out.type_as(xq), xk_out.type_as(xk)
-
 # repeat_kv 函数用于重复键值对。
 def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
@ -179,6 +93,8 @@ class Attention(nn.Module):
    def forward(self,
                x: torch.Tensor,
                pos_cis: torch.Tensor,
+                past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+                use_cache=True,
                db_value=None):
        bsz, seq_len, _ = x.shape #bsz: 批量大小, seq_len: 序列长度, _: 隐藏维度
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x) #将输入张量x分别通过线性层wq, wk, wv进行变换，得到查询、键和值。
@ -186,13 +102,13 @@ class Attention(nn.Module):
        xk = xk.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim) #将变换后的张量xk重塑为形状为(bsz, seq_len, n_local_kv_heads, head_dim)的形状。
        xv = xv.view(bsz, seq_len, self.n_local_kv_heads, self.head_dim) #将变换后的张量xv重塑为形状为(bsz, seq_len, n_local_kv_heads, head_dim)的形状。

-        # 应用旋转位置编码（使用实数版本）
-        xq, xk = apply_rotary_emb_real(xq, xk, pos_cis)
-        # kv_cache实现 REMOVED
-        # if past_key_value is not None:
-        #     xk = torch.cat([past_key_value[0], xk], dim=1)
-        #     xv = torch.cat([past_key_value[1], xv], dim=1)
-        # past_kv = (xk, xv) if use_cache else None
+        # 应用旋转位置编码
+        xq, xk = apply_rotary_emb(xq, xk, pos_cis)
+        # kv_cache实现
+        if past_key_value is not None:
+            xk = torch.cat([past_key_value[0], xk], dim=1)
+            xv = torch.cat([past_key_value[1], xv], dim=1)
+        past_kv = (xk, xv) if use_cache else None

        # 重复键值对
        xq, xk, xv = (
@ -245,7 +161,7 @@ class Attention(nn.Module):

        output = output.transpose(1, 2).reshape(bsz, seq_len, -1)
        output = self.resid_dropout(self.wo(output))
-        return output
+        return output, past_kv



@ -457,7 +373,7 @@ class MiniMindBlock(nn.Module):
        #     self.product_key_topk = min(16, self.num_keys)  # 确保不超过num_keys
        #     self.num_experts_per_head_topk = 1  # 最终每个头选取的专家数

-    def forward(self, x, db_value, pos_cis):
+    def forward(self, x, db_value, pos_cis, past_key_value=None, use_cache=True):
        # import pdb;pdb.set_trace()
        # db_value = None

@ -502,9 +418,11 @@ class MiniMindBlock(nn.Module):


        # 注意力计算
-        h_attn = self.attention(
+        h_attn, past_kv = self.attention(
            self.attention_norm(x),
            pos_cis,
+            past_key_value=past_key_value,
+            use_cache=use_cache,
            db_value=db_value
        )

@ -515,7 +433,7 @@ class MiniMindBlock(nn.Module):

        # 前馈神经网络
        out = h + self.feed_forward(self.ffn_norm(h))
-        return out
+        return out, past_kv

 class ExtractDB(nn.Module):
    def __init__(self,params):
@ -524,15 +442,15 @@ class ExtractDB(nn.Module):
        self.batch_size = None
        self.dim = params.dim
        self.dim_key = self.dim // 2
-        self.knowlwdge_num = params.knowlwdge_num  # 100专家，确保是完全平方数
+        self.num_experts = 10 * 10  # 100专家，确保是完全平方数
        # 将knowledge_dim设置为与head_dim相同，以便在attention中直接使用
        self.head_dim = params.dim // params.n_heads
-        self.knowledge_length = params.knowlwdge_length*params.dim
+        self.knowledge_dim = 8*params.dim

        # 使用register_buffer代替nn.Parameter，避免梯度问题
-        self.register_buffer('weight_down_embed', torch.randn(self.knowlwdge_num, self.knowledge_length) * 0.02)
+        self.register_buffer('weight_down_embed', torch.randn(self.num_experts, self.knowledge_dim) * 0.02)

-        self.num_keys = int(math.sqrt(self.knowlwdge_num)) if self.knowlwdge_num > 0 else 0
+        self.num_keys = int(math.sqrt(self.num_experts)) if self.num_experts > 0 else 0
        self.product_key_topk = min(16, self.num_keys)
        self.keys = nn.Parameter(torch.randn(self.num_keys, 2, self.dim_key) * 0.02)
        self.num_experts_per_head_topk = 1
@ -630,19 +548,22 @@ class MiniMindLM(PreTrainedModel):
        self.downsample_q_specific = nn.Sequential(
            nn.Conv1d(128*8, 512, kernel_size=1, padding='same')
        )
-        # 使用实数版本的位置编码，避免复数张量可能导致的段错误
-        self.register_buffer("pos_cis_real",
-                             precompute_pos_cis_real(dim=params.dim // params.n_heads, theta=params.rope_theta),
+        self.register_buffer("pos_cis",
+                             precompute_pos_cis(dim=params.dim // params.n_heads, theta=params.rope_theta),
                             persistent=False)
        self.params = params

    def forward(self,
                input_ids: Optional[torch.Tensor] = None,
+                past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
+                use_cache: bool = False,
                logits_to_keep: Union[int, torch.Tensor] = 0,
                **args):
+        past_key_values = past_key_values or [None] * len(self.layers)
        start_pos = args.get('start_pos', 0)
        h = self.dropout(self.tok_embeddings(input_ids))
-        pos_cis_real = self.pos_cis_real[start_pos:start_pos + input_ids.size(1)]
+        pos_cis = self.pos_cis[start_pos:start_pos + input_ids.size(1)]
+        past_kvs = []
        h_list = []

        for l, layer in enumerate(self.layers):
@ -657,10 +578,13 @@ class MiniMindLM(PreTrainedModel):
                index = self.extract_db.q_to_k(h)
                db_value = self.extract_db.get_data(index)

-            h = layer(
-                h, db_value, pos_cis_real
+            h, past_kv = layer(
+                h, db_value, pos_cis,
+                past_key_value=past_key_values[l],
+                use_cache=use_cache
            )

+            past_kvs.append(past_kv)
            h_list.append(h.unsqueeze(0))

        h_tensor = torch.cat(h_list, dim=0).permute(1, 0, 2, 3)
@ -687,6 +611,7 @@ class MiniMindLM(PreTrainedModel):
        # 进一步简化，只保留必要的参数
        output = CausalLMOutputWithPast(
            logits=logits,
+            past_key_values=past_kvs,
        )
        output.hidden_states = h

@ -702,17 +627,17 @@ class MiniMindLM(PreTrainedModel):

    @torch.inference_mode()
    def generate(self, input_ids, eos_token_id=2, max_new_tokens=1024, temperature=0.75, top_p=0.90,
-                 stream=False, rp=1., pad_token_id=0, num_return_sequences=1, **args):
+                 stream=False, rp=1., use_cache=True, pad_token_id=0, num_return_sequences=1, **args):
        # 流式生成
        if stream:
-            return self._stream(input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, **args)
+            return self._stream(input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args)

        # 直接生成
        generated = []
        for i in range(input_ids.size(0)):
            non_pad = input_ids[i][input_ids[i] != pad_token_id].unsqueeze(0)
            for _ in range(num_return_sequences):
-                out = self._stream(non_pad, eos_token_id, max_new_tokens, temperature, top_p, rp, **args)
+                out = self._stream(non_pad, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args)
                tokens_list = [tokens[:, -1:] for tokens in out]
                gen = torch.cat(tokens_list, dim=-1) if tokens_list else non_pad
                full_sequence = torch.cat([non_pad, gen], dim=-1)
@ -729,14 +654,15 @@ class MiniMindLM(PreTrainedModel):
        res = output.view(input_ids.size(0) * num_return_sequences, -1)
        return res

-    def _stream(self, input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, **args):
-        start, first_seq = input_ids.shape[1], True
+    def _stream(self, input_ids, eos_token_id, max_new_tokens, temperature, top_p, rp, use_cache, **args):
+        start, first_seq, past_kvs = input_ids.shape[1], True, None
        while input_ids.shape[1] < max_new_tokens - 1:
-            if first_seq:
-                out, first_seq = self(input_ids, **args), False
+            if first_seq or not use_cache:
+                out, first_seq = self(input_ids, past_key_values=past_kvs, use_cache=use_cache, **args), False
            else:
-                out = self(input_ids[:, -1:], start_pos=input_ids.shape[1] - 1, **args)
-            logits = out.logits[:, -1, :]
+                out = self(input_ids[:, -1:], past_key_values=past_kvs, use_cache=use_cache,
+                           start_pos=input_ids.shape[1] - 1, **args)
+            logits, past_kvs = out.logits[:, -1, :], out.past_key_values
            logits[:, list(set(input_ids.tolist()[0]))] /= rp
            logits /= (temperature + 1e-9)
            if top_p is not None and top_p < 1.0:
--- a/requirements.txt
+++ b/requirements.txt
@ -1,147 +1,30 @@
-accelerate==1.6.0
-aiohappyeyeballs==2.6.1
-aiohttp==3.11.17
-aiosignal==1.3.2
-altair==5.5.0
-annotated-types==0.7.0
-anyio==4.9.0
-async-timeout==5.0.1
-attrs==25.3.0
-blinker==1.9.0
-cachetools==5.5.2
-certifi==2025.1.31
-charset-normalizer==3.4.1
-click==8.1.8
-contourpy==1.3.2
-cycler==0.12.1
 datasets==2.21.0
 datasketch==1.6.4
-deepspeed==0.16.7
-dill==0.3.8
-distro==1.9.0
-docker-pycreds==0.4.0
-einops==0.8.1
-exceptiongroup==1.2.2
-filelock==3.18.0
 Flask==3.0.3
-Flask-Cors==4.0.0
-fonttools==4.57.0
-frozenlist==1.6.0
-fsspec==2024.6.1
-gitdb==4.0.12
-GitPython==3.1.44
-h11==0.14.0
-hjson==3.1.0
-httpcore==1.0.8
-httpx==0.28.1
-huggingface-hub==0.30.2
-idna==3.10
-importlib_metadata==7.2.1
-itsdangerous==2.2.0
+Flask_Cors==4.0.0
 jieba==0.42.1
-Jinja2==3.1.2
-jiter==0.9.0
-joblib==1.4.2
 jsonlines==4.0.0
-jsonschema==4.23.0
-jsonschema-specifications==2024.10.1
-kiwisolver==1.4.8
-markdown-it-py==3.0.0
-MarkupSafe==3.0.2
 marshmallow==3.22.0
 matplotlib==3.10.0
-mdurl==0.1.2
-modelscope==1.25.0
-mpmath==1.3.0
-msgpack==1.1.0
-multidict==6.4.3
-multiprocess==0.70.16
-narwhals==1.35.0
-networkx==3.4.2
 ngrok==1.4.0
-ninja==1.11.1.4
 nltk==3.8
 numpy==1.26.4
-nvidia-cublas-cu11==11.11.3.6
-nvidia-cublas-cu12==12.1.3.1
-nvidia-cuda-cupti-cu11==11.8.87
-nvidia-cuda-cupti-cu12==12.1.105
-nvidia-cuda-nvrtc-cu11==11.8.89
-nvidia-cuda-nvrtc-cu12==12.1.105
-nvidia-cuda-runtime-cu11==11.8.89
-nvidia-cuda-runtime-cu12==12.1.105
-nvidia-cudnn-cu11==9.1.0.70
-nvidia-cudnn-cu12==8.9.2.26
-nvidia-cufft-cu11==10.9.0.58
-nvidia-cufft-cu12==11.0.2.54
-nvidia-curand-cu11==10.3.0.86
-nvidia-curand-cu12==10.3.2.106
-nvidia-cusolver-cu11==11.4.1.48
-nvidia-cusolver-cu12==11.4.5.107
-nvidia-cusparse-cu11==11.7.5.86
-nvidia-cusparse-cu12==12.1.0.106
-nvidia-nccl-cu11==2.21.5
-nvidia-nccl-cu12==2.19.3
-nvidia-nvjitlink-cu12==12.8.93
-nvidia-nvtx-cu11==11.8.86
-nvidia-nvtx-cu12==12.1.105
 openai==1.59.6
-packaging==23.2
 pandas==1.5.3
 peft==0.7.1
-pillow==10.4.0
-platformdirs==4.3.7
-propcache==0.3.1
-protobuf==4.25.6
 psutil==5.9.8
-py-cpuinfo==9.0.0
-pyarrow==19.0.1
 pydantic==2.8.2
-pydantic_core==2.20.1
-pydeck==0.9.1
-Pygments==2.19.1
-pyparsing==3.2.3
-python-dateutil==2.9.0.post0
-pytz==2025.2
-PyYAML==6.0.2
-referencing==0.36.2
-regex==2024.11.6
-requests==2.32.3
 rich==13.7.1
-rpds-py==0.24.0
-safetensors==0.5.3
-scikit-learn==1.5.1
-scipy==1.15.2
-sentence-transformers==2.3.1
-sentencepiece==0.2.0
-sentry-sdk==2.26.1
-setproctitle==1.3.5
+scikit_learn==1.5.1
+sentence_transformers==2.3.1
 simhash==2.1.2
-six==1.17.0
-smmap==5.0.2
-sniffio==1.3.1
-streamlit==1.30.0
-sympy==1.13.3
-tenacity==8.5.0
-threadpoolctl==3.6.0
 tiktoken==0.5.1
-tokenizers==0.21.1
-toml==0.10.2
-torch==2.7.0+cu118
-torchvision==0.22.0+cu118
-tornado==6.4.2
-tqdm==4.67.1
 transformers==4.48.0
-triton==3.3.0
+jinja2==3.1.2
+jsonlines==4.0.0
 trl==0.13.0
-typing_extensions==4.13.2
-tzlocal==5.3.1
 ujson==5.1.0
-urllib3==2.4.0
-validators==0.34.0
 wandb==0.18.3
-watchdog==6.0.0
-Werkzeug==3.1.3
-xxhash==3.5.0
-yarl==1.20.0
-zipp==3.21.0
+streamlit==1.30.0
+torch==2.2.2
+torchvision==0.17.2
--- a/run_file/DynamicKV-LLM_Mini_Minimind.sh
+++ b/run_file/DynamicKV-LLM_Mini_Minimind.sh
@ -1,48 +0,0 @@
-#!/bin/bash
-
-# 激活conda环境
-source $(conda info --base)/etc/profile.d/conda.sh
-conda activate ycz_accelerate
-
-# 设置环境变量以帮助调试
-export NCCL_DEBUG=INFO
-export PYTHONFAULTHANDLER=1
-
-# 方法1: 使用预先配置的accelerate配置文件
-# accelerate launch --config_file accelerate_config.yaml train_pretrain_accelerate.py \
-#     --epochs 3 \
-#     --batch_size 24 \
-#     --learning_rate 2e-4 \
-#     --dtype bfloat16 \
-#     --accumulation_steps 32 \
-#     --grad_clip 1.0 \
-#     --log_interval 100 \
-#     --save_interval 10000 \
-#     --dim 1024 \
-#     --n_layers 32 \
-#     --max_seq_len 1024 \
-#     --use_flash_attn \
-#     --profile \
-#     --profile_interval 10
-
-# 方法2: 使用命令行参数直接配置accelerate
-CUDA_VISIBLE_DEVICES=0 accelerate launch \
-    --multi_gpu \
-    --num_processes=4 \
-    --mixed_precision=bf16 \
-    --main_process_port=29500 \
-    train_pretrain_accelerate.py \
-    --epochs 3 \
-    --batch_size 24 \
-    --learning_rate 2e-4 \
-    --dtype bfloat16 \
-    --accumulation_steps 32 \
-    --grad_clip 1.0 \
-    --log_interval 100 \
-    --save_interval 10000 \
-    --dim 512 \
-    --n_layers 12 \
-    --max_seq_len 512 \
-    --use_flash_attn \
-    --profile \
-    --profile_interval 10
--- a/run_file/DynamicKV-LLM_Small_Minimind.sh
+++ b/run_file/DynamicKV-LLM_Small_Minimind.sh
@ -1,48 +0,0 @@
-#!/bin/bash
-
-# 激活conda环境
-source $(conda info --base)/etc/profile.d/conda.sh
-conda activate ycz_accelerate
-
-# 设置环境变量以帮助调试
-export NCCL_DEBUG=INFO
-export PYTHONFAULTHANDLER=1
-
-# 方法1: 使用预先配置的accelerate配置文件
-# accelerate launch --config_file accelerate_config.yaml train_pretrain_accelerate.py \
-#     --epochs 3 \
-#     --batch_size 24 \
-#     --learning_rate 2e-4 \
-#     --dtype bfloat16 \
-#     --accumulation_steps 32 \
-#     --grad_clip 1.0 \
-#     --log_interval 100 \
-#     --save_interval 10000 \
-#     --dim 1024 \
-#     --n_layers 32 \
-#     --max_seq_len 1024 \
-#     --use_flash_attn \
-#     --profile \
-#     --profile_interval 10
-
-# 方法2: 使用命令行参数直接配置accelerate
-CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
-    --multi_gpu \
-    --num_processes=4 \
-    --mixed_precision=bf16 \
-    --main_process_port=29500 \
-    train_pretrain_accelerate.py \
-    --epochs 3 \
-    --batch_size 24 \
-    --learning_rate 2e-4 \
-    --dtype bfloat16 \
-    --accumulation_steps 32 \
-    --grad_clip 1.0 \
-    --log_interval 100 \
-    --save_interval 10000 \
-    --dim 1024 \
-    --n_layers 32 \
-    --max_seq_len 1024 \
-    --use_flash_attn \
-    --profile \
-    --profile_interval 10
--- a/test_real_rope.py
+++ b/test_real_rope.py
@ -1,97 +0,0 @@
-#!/usr/bin/env python
-# -*- coding: utf-8 -*-
-
-"""
-测试实数版本的位置编码
-"""
-
-import torch
-from model.model import precompute_pos_cis, precompute_pos_cis_real, apply_rotary_emb, apply_rotary_emb_real
-from model.LMConfig import LMConfig
-from model.model import MiniMindLM
-
-def test_pos_encoding_equivalence():
-    """测试复数版本和实数版本的位置编码是否等价"""
-    print("测试位置编码等价性...")
-
-    # 参数设置
-    dim = 64
-    seq_len = 10
-
-    # 生成复数版本的位置编码
-    pos_cis = precompute_pos_cis(dim=dim, end=seq_len)
-
-    # 生成实数版本的位置编码
-    pos_cis_real = precompute_pos_cis_real(dim=dim, end=seq_len)
-
-    # 创建随机查询和键
-    batch_size = 2
-    n_heads = 4
-    head_dim = dim
-
-    xq = torch.randn(batch_size, seq_len, n_heads, head_dim)
-    xk = torch.randn(batch_size, seq_len, n_heads, head_dim)
-
-    # 应用复数版本的旋转位置编码
-    xq_complex, xk_complex = apply_rotary_emb(xq, xk, pos_cis)
-
-    # 应用实数版本的旋转位置编码
-    xq_real, xk_real = apply_rotary_emb_real(xq, xk, pos_cis_real)
-
-    # 计算差异
-    q_diff = torch.abs(xq_complex - xq_real).mean().item()
-    k_diff = torch.abs(xk_complex - xk_real).mean().item()
-
-    print(f"查询差异: {q_diff:.6f}")
-    print(f"键差异: {k_diff:.6f}")
-
-    # 检查差异是否在可接受范围内
-    tolerance = 1e-5
-    if q_diff < tolerance and k_diff < tolerance:
-        print("✅ 测试通过: 复数版本和实数版本的位置编码在数值上等价")
-    else:
-        print("❌ 测试失败: 复数版本和实数版本的位置编码存在显著差异")
-
-def test_model_forward():
-    """测试模型前向传播"""
-    print("\n测试模型前向传播...")
-
-    # 创建模型配置
-    config = LMConfig(
-        dim=128,
-        n_layers=2,
-        n_heads=4,
-        n_kv_heads=4,  # 确保n_kv_heads被设置，且n_heads能被n_kv_heads整除
-        vocab_size=1000,
-        max_seq_len=128,
-        disable_db=True  # 禁用数据库功能，避免额外的复杂性
-    )
-
-    # 创建模型
-    try:
-        model = MiniMindLM(config)
-        print(f"✅ 模型初始化成功")
-    except Exception as e:
-        print(f"❌ 模型初始化失败: {str(e)}")
-        return
-
-    # 创建输入
-    batch_size = 2
-    seq_len = 10
-    input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
-
-    # 前向传播
-    try:
-        with torch.no_grad():
-            outputs = model(input_ids)
-        print(f"✅ 模型前向传播成功")
-        print(f"输出形状: {outputs.logits.shape}")
-    except Exception as e:
-        print(f"❌ 模型前向传播失败: {str(e)}")
-
-if __name__ == "__main__":
-    # 测试位置编码等价性
-    test_pos_encoding_equivalence()
-
-    # 测试模型前向传播
-    test_model_forward()
--- a/train_pretrain.py
+++ b/train_pretrain.py
@ -13,7 +13,6 @@ from torch import optim, nn
 from torch.nn.parallel import DistributedDataParallel
 from torch.optim.lr_scheduler import CosineAnnealingLR
 from torch.utils.data import DataLoader, DistributedSampler
-# 移除通信分析工具导入
 from contextlib import nullcontext
 from typing import Optional

@ -43,67 +42,18 @@ def train_epoch(epoch, wandb):
    start_time = time.time()
    # 在函数开始处定义moe_path，避免在异常处理中引用未定义变量
    moe_path = '_moe' if lm_config.use_moe else ''
-
-    # 添加CUDA事件来分析性能
-    if args.profile and (not ddp or dist.get_rank() == 0):
-        data_start = torch.cuda.Event(enable_timing=True)
-        data_end = torch.cuda.Event(enable_timing=True)
-        forward_start = torch.cuda.Event(enable_timing=True)
-        forward_end = torch.cuda.Event(enable_timing=True)
-        backward_start = torch.cuda.Event(enable_timing=True)
-        backward_end = torch.cuda.Event(enable_timing=True)
-        optimizer_start = torch.cuda.Event(enable_timing=True)
-        optimizer_end = torch.cuda.Event(enable_timing=True)
-
-    # 移除CUDA图优化代码
-
-    # 预取数据
-    prefetch_factor = 2  # 预取的批次数
-    data_iter = iter(train_loader)
-    prefetch_batches = []
-
-    # 预取初始批次
-    for _ in range(min(prefetch_factor, len(train_loader))):
+    for step, (X, Y, loss_mask) in enumerate(train_loader):
        try:
-            batch = next(data_iter)
-            prefetch_batches.append([t.to(args.device, non_blocking=True) for t in batch])
-        except StopIteration:
-            break
-
-    for step in range(len(train_loader)):
-        try:
-            # 计时数据加载
-            if args.profile and (not ddp or dist.get_rank() == 0):
-                data_start.record()
-
-            # 使用预取的数据
-            if prefetch_batches:
-                X, Y, loss_mask = prefetch_batches.pop(0)
-            else:
-                # 如果预取队列为空，直接加载
-                X, Y, loss_mask = [t.to(args.device) for t in next(data_iter)]
-
-            # 异步预取下一批数据
-            if step + prefetch_factor < len(train_loader):
-                try:
-                    batch = next(data_iter)
-                    prefetch_batches.append([t.to(args.device, non_blocking=True) for t in batch])
-                except StopIteration:
-                    pass
-
-            if args.profile and (not ddp or dist.get_rank() == 0):
-                data_end.record()
+            # 将数据加载到设备上
+            X = X.to(args.device)
+            Y = Y.to(args.device)
+            loss_mask = loss_mask.to(args.device)

            # 更新学习率
            lr = get_lr(epoch * iter_per_epoch + step, args.epochs * iter_per_epoch, args.learning_rate)
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr

-            # 计时前向传播
-            if args.profile and (not ddp or dist.get_rank() == 0):
-                forward_start.record()
-
-            # 常规前向传播
            with ctx:
                res = model(X)
                loss = loss_fct(
@ -127,13 +77,6 @@ def train_epoch(epoch, wandb):
                    # 如果出错，不添加辅助损失
                loss = loss / args.accumulation_steps

-            # 反向传播
-            scaler.scale(loss).backward()
-
-            if args.profile and (not ddp or dist.get_rank() == 0):
-                forward_end.record()
-                backward_start.record()
-
            # Print data types for debugging
            if step == 0 and (not ddp or dist.get_rank() == 0): # Print only for the first step of the first epoch on the main process
                Logger("---- Data Type Check ----")
@ -146,21 +89,9 @@ def train_epoch(epoch, wandb):
                Logger(f"loss.dtype: {loss.dtype}")
                Logger("-------------------------")

-            if args.profile and (not ddp or dist.get_rank() == 0):
-                backward_end.record()
+            scaler.scale(loss).backward()

-                # 在每一步都进行性能分析，而不仅仅是在梯度累积完成时
-                if (step + 1) % args.profile_interval == 0:
-                    # 记录优化器时间（如果是梯度累积步骤）
-                    if (step + 1) % args.accumulation_steps == 0:
-                        optimizer_start.record()
-
-            # 优化器步骤
            if (step + 1) % args.accumulation_steps == 0:
-                if args.profile and (not ddp or dist.get_rank() == 0):
-                    if (step + 1) % args.profile_interval != 0:
-                        optimizer_start.record()
-
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_clip)

@ -169,40 +100,6 @@ def train_epoch(epoch, wandb):

                optimizer.zero_grad(set_to_none=True)

-                if args.profile and (not ddp or dist.get_rank() == 0):
-                    optimizer_end.record()
-
-            # 性能分析输出（每profile_interval步）
-            if args.profile and (not ddp or dist.get_rank() == 0) and (step + 1) % args.profile_interval == 0:
-                # 同步CUDA事件以获取准确的计时
-                torch.cuda.synchronize()
-
-                # 计算各阶段耗时
-                data_time = data_start.elapsed_time(data_end)
-                forward_time = forward_start.elapsed_time(forward_end)
-                backward_time = backward_start.elapsed_time(backward_end)
-
-                # 只有在梯度累积步骤完成时才有优化器时间
-                if (step + 1) % args.accumulation_steps == 0:
-                    optimizer_time = optimizer_start.elapsed_time(optimizer_end)
-                    total_compute_time = forward_time + backward_time + optimizer_time
-                    Logger(f"性能分析 - 步骤 {step+1}:")
-                    Logger(f"  数据加载时间: {data_time:.2f} ms")
-                    Logger(f"  前向传播时间: {forward_time:.2f} ms")
-                    Logger(f"  反向传播时间: {backward_time:.2f} ms")
-                    Logger(f"  优化器时间: {optimizer_time:.2f} ms")
-                    Logger(f"  总计算时间: {total_compute_time:.2f} ms")
-                    Logger(f"  计算/数据比例: {total_compute_time / data_time:.2f}")
-                else:
-                    # 非梯度累积步骤，没有优化器时间
-                    total_compute_time = forward_time + backward_time
-                    Logger(f"性能分析 - 步骤 {step+1} (梯度累积中):")
-                    Logger(f"  数据加载时间: {data_time:.2f} ms")
-                    Logger(f"  前向传播时间: {forward_time:.2f} ms")
-                    Logger(f"  反向传播时间: {backward_time:.2f} ms")
-                    Logger(f"  总计算时间: {total_compute_time:.2f} ms")
-                    Logger(f"  计算/数据比例: {total_compute_time / data_time:.2f}")
-
            # 打印日志
            if step % args.log_interval == 0:
                spend_time = time.time() - start_time
@ -217,39 +114,9 @@ def train_epoch(epoch, wandb):
                        spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))

                if (wandb is not None) and (not ddp or dist.get_rank() == 0):
-                    log_dict = {
-                        "loss": loss.item() * args.accumulation_steps,
-                        "lr": optimizer.param_groups[-1]['lr'],
-                        "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60
-                    }
-
-                    # 如果启用了性能分析，也记录性能指标
-                    if args.profile and (step + 1) % args.profile_interval == 0:
-                        # 基本性能指标
-                        perf_dict = {
-                            "data_time_ms": data_time,
-                            "forward_time_ms": forward_time,
-                            "backward_time_ms": backward_time
-                        }
-
-                        # 只有在梯度累积步骤完成时才有优化器时间
-                        if (step + 1) % args.accumulation_steps == 0:
-                            total_compute_time = forward_time + backward_time + optimizer_time
-                            perf_dict.update({
-                                "optimizer_time_ms": optimizer_time,
-                                "compute_time_ms": total_compute_time
-                            })
-                        else:
-                            total_compute_time = forward_time + backward_time
-                            perf_dict.update({
-                                "compute_time_ms": total_compute_time
-                            })
-
-                        log_dict.update(perf_dict)
-
-                    wandb.log(log_dict)
-
-            # 移除通信分析代码
+                    wandb.log({"loss": loss.item() * args.accumulation_steps,
+                               "lr": optimizer.param_groups[-1]['lr'],
+                               "epoch_Time": spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60})

            # 保存模型
            if (step + 1) % args.save_interval == 0 and (not ddp or dist.get_rank() == 0):
@ -309,9 +176,6 @@ def init_model(lm_config, pretrained_embedding_path: Optional[str] = None):
    return model, tokenizer


-# 移除通信分析函数
-
-
 def init_distributed_mode():
    if not ddp: return #如果没有启用分布式数据并行(DDP)，直接返回，不执行任何操作。
    global ddp_local_rank, DEVICE #声明这两个变量为全局变量，以便在函数外部也能访问它们。
@ -330,42 +194,35 @@ if __name__ == "__main__":
    parser.add_argument("--out_dir", type=str, default="out")
    # 若要以最快速度实现zero则epochs设置为1轮；否则应当利用有限的数据训练2~6个epochs。
    parser.add_argument("--epochs", type=int, default=3)
-    parser.add_argument("--batch_size", type=int, default=24)
+    parser.add_argument("--batch_size", type=int, default=8)
    parser.add_argument("--learning_rate", type=float, default=2e-4)
    parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu") #如果GPU可用，则使用GPU，否则使用CPU。
    parser.add_argument("--dtype", type=str, default="bfloat16")
    parser.add_argument("--use_wandb", default=True, action="store_true")
    parser.add_argument("--wandb_project", type=str, default="MiniMind-Pretrain")
-    parser.add_argument("--num_workers", type=int, default=48)
+    parser.add_argument("--num_workers", type=int, default=8)
    parser.add_argument("--ddp", action="store_true")
-    parser.add_argument("--accumulation_steps", type=int, default=32) #梯度累积步数，用于控制梯度更新频率。
+    parser.add_argument("--accumulation_steps", type=int, default=64) #梯度累积步数，用于控制梯度更新频率。
    parser.add_argument("--grad_clip", type=float, default=1.0) #梯度裁剪阈值，用于防止梯度爆炸。
    parser.add_argument("--warmup_iters", type=int, default=0) #预热迭代次数，用于控制学习率预热过程。
    parser.add_argument("--log_interval", type=int, default=100) #日志打印间隔，用于控制日志打印的频率。
-    parser.add_argument("--save_interval", type=int, default=10000) #模型保存间隔，用于控制模型保存的频率。
+    parser.add_argument("--save_interval", type=int, default=100) #模型保存间隔，用于控制模型保存的频率。
    parser.add_argument('--local_rank', type=int, default=-1) #本地进程编号，用于分布式训练。
-    parser.add_argument('--dim', default=1024, type=int) #模型维度，用于控制模型的大小。
+    parser.add_argument('--dim', default=2048, type=int) #模型维度，用于控制模型的大小。
    parser.add_argument('--n_layers', default=32, type=int) #层数，用于控制模型层数。
    parser.add_argument('--max_seq_len', default=1024, type=int) #最大序列长度，用于控制输入序列的最大长度。
    parser.add_argument('--use_moe', default=False, type=bool) #是否使用MOE，用于控制是否使用MOE。
    parser.add_argument('--disable_db', action='store_true', help="禁用数据库功能，使用固定值1e-4替代") #禁用数据库功能，启用特殊模式
    parser.add_argument("--data_path", type=str, default="./dataset/pretrain_hq.jsonl") #数据路径，用于控制数据集的路径。
    parser.add_argument("--pretrained_embedding_path", type=str, default=None, help="Path to pretrained token embedding weights (.pth file)")
-    # 性能分析相关参数
-    parser.add_argument("--profile", action="store_true", default=True, help="启用性能分析")
-    parser.add_argument("--profile_interval", type=int, default=10, help="性能分析打印间隔（步数）")
-    parser.add_argument("--use_flash_attn", action="store_true", default=True, help="启用FlashAttention")
    args = parser.parse_args()
-    print(args)
-

    lm_config = LMConfig(
        dim=args.dim,
        n_layers=args.n_layers,
        max_seq_len=args.max_seq_len,
        use_moe=args.use_moe,
-        disable_db=args.disable_db,  # 添加禁用数据库参数
-        flash_attn=args.use_flash_attn  # 添加FlashAttention支持
+        disable_db=args.disable_db  # 添加禁用数据库参数
    ) #创建LMConfig对象，用于控制模型配置。
    args.save_dir = os.path.join(args.out_dir) #创建保存目录。
    os.makedirs(args.save_dir, exist_ok=True) #创建保存目录。
@ -410,31 +267,24 @@ if __name__ == "__main__":
    model, tokenizer = init_model(lm_config, args.pretrained_embedding_path)
    train_ds = PretrainDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
    train_sampler = DistributedSampler(train_ds) if ddp else None
-    # 优化DataLoader配置
    train_loader = DataLoader(
        train_ds,
        batch_size=args.batch_size,
        pin_memory=True,
-        pin_memory_device=f"cuda:{ddp_local_rank}" if ddp else "cuda:0",  # 指定pin_memory设备
        drop_last=False,
        shuffle=False,
        num_workers=args.num_workers,
-        sampler=train_sampler,
-        persistent_workers=True if args.num_workers > 0 else False,  # 保持worker进程活跃
-        prefetch_factor=2 if args.num_workers > 0 else None  # 预取因子
+        sampler=train_sampler
    )

-    # 只有在使用float16时才启用GradScaler，bfloat16不需要
-    scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype == 'float16'))
+    scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16']))
    optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)

    if ddp:
        model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
-        # 保留find_unused_parameters=True参数，因为模型中确实有未使用的参数
+        # 添加find_unused_parameters=True参数，解决未使用参数的问题
        model = DistributedDataParallel(model, device_ids=[ddp_local_rank], find_unused_parameters=True)

-    # 暂时保留set_detect_anomaly以便调试
-    # 训练稳定后可以注释掉这行来提高速度
    torch.autograd.set_detect_anomaly(True)
    iter_per_epoch = len(train_loader)
    for epoch in range(args.epochs):
--- a/train_pretrain_accelerate.py
+++ b/train_pretrain_accelerate.py
@ -1,398 +0,0 @@
-import os
-# 设置环境变量
-os.environ["WANDB_MODE"] = "offline"  # 或者使用 "dryrun"
-import platform
-import argparse
-import time
-import math
-import warnings
-import pandas as pd
-import torch
-from torch import optim, nn
-from torch.utils.data import DataLoader
-from contextlib import nullcontext
-from typing import Optional
-import datetime # Add datetime for time formatting
-from accelerate import Accelerator
-from accelerate.utils import set_seed
-from accelerate.utils import DeepSpeedPlugin
-from accelerate.utils import DistributedDataParallelKwargs
-from transformers import AutoTokenizer, get_cosine_schedule_with_warmup
-
-from model.model import MiniMindLM
-from model.LMConfig import LMConfig
-from model.dataset import PretrainDataset
-
-warnings.filterwarnings('ignore')
-
-# 日志记录函数
-def Logger(msg, accelerator=None):
-    # 如果没有提供accelerator，则只在主进程打印
-    if accelerator is None or accelerator.is_main_process:
-        print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {msg}")
-
-# Helper function to format seconds into HH:MM:SS
-def format_time(seconds):
-    return str(datetime.timedelta(seconds=int(seconds)))
-
-# 获取学习率函数
-def get_lr(it, num_iters, learning_rate):
-    # 余弦学习率衰减
-    return learning_rate * 0.5 * (1.0 + math.cos(math.pi * it / num_iters))
-
-# 初始化模型函数
-def init_model(lm_config, pretrained_embedding_path=None):
-    tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
-    model = MiniMindLM(lm_config)
-
-    # 如果提供了预训练的嵌入权重，加载它们
-    if pretrained_embedding_path:
-        Logger(f"Loading pretrained token embeddings from {pretrained_embedding_path}")
-        pretrained_embeddings = torch.load(pretrained_embedding_path)
-        model.tok_embeddings.weight.data.copy_(pretrained_embeddings)
-        model.output.weight.data.copy_(pretrained_embeddings)  # 共享权重
-
-    Logger(f'LLM总参数量：{sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.3f} 百万')
-    return model, tokenizer
-
-def train_epoch(epoch, accelerator, model, train_loader, optimizer, scheduler, args, ctx, overall_start_time):
-    loss_fct = nn.CrossEntropyLoss(reduction='none')
-    epoch_start_time = time.time()
-    total_steps_in_epoch = len(train_loader)
-    total_training_steps = args.epochs * total_steps_in_epoch
-    moe_path = '_moe' if args.use_moe else ''
-
-    # 添加CUDA事件来分析性能 (只在主进程进行)
-    if args.profile and accelerator.is_main_process:
-        data_start = torch.cuda.Event(enable_timing=True)
-        data_end = torch.cuda.Event(enable_timing=True)
-        forward_start = torch.cuda.Event(enable_timing=True)
-        forward_end = torch.cuda.Event(enable_timing=True)
-        backward_start = torch.cuda.Event(enable_timing=True)
-        backward_end = torch.cuda.Event(enable_timing=True)
-        optimizer_start = torch.cuda.Event(enable_timing=True)
-        optimizer_end = torch.cuda.Event(enable_timing=True)
-
-    # 预取数据
-    prefetch_factor = 2  # 预取的批次数
-    data_iter = iter(train_loader)
-    prefetch_batches = []
-
-    # 预取初始批次
-    for _ in range(min(prefetch_factor, len(train_loader))):
-        try:
-            batch = next(data_iter)
-            prefetch_batches.append(batch)
-        except StopIteration:
-            break
-
-    # 在开始循环前初始化日志记录所需变量
-    last_log_time = epoch_start_time
-
-    for step in range(total_steps_in_epoch):
-        try:
-            # 计时数据加载 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                data_start.record()
-
-            # 使用预取的数据
-            if prefetch_batches:
-                X, Y, loss_mask = prefetch_batches.pop(0)
-            else:
-                # 如果预取队列为空，直接加载
-                X, Y, loss_mask = next(data_iter)
-
-            # 异步预取下一批数据
-            if step + prefetch_factor < len(train_loader):
-                try:
-                    batch = next(data_iter)
-                    prefetch_batches.append(batch)
-                except StopIteration:
-                    pass
-
-            # 计时数据加载结束 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                data_end.record()
-
-            # 更新学习率
-            if scheduler is not None:
-                scheduler.step()
-
-            # 计时前向传播 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                forward_start.record()
-
-            # 前向传播
-            with ctx:
-                res = model(X)
-                loss = loss_fct(
-                    res.logits.view(-1, res.logits.size(-1)),
-                    Y.view(-1)
-                ).view(Y.size())
-                loss = (loss * loss_mask).sum() / loss_mask.sum()
-                # 添加辅助损失，如果存在的话
-                try:
-                    aux_loss = sum(l.feed_forward.aux_loss for l in model.module.layers
-                                  if hasattr(l.feed_forward, 'aux_loss'))
-                    loss += aux_loss
-                except Exception as e:
-                    Logger(f"Warning: Could not add auxiliary loss: {e}")
-                    # 如果出错，不添加辅助损失
-                loss = loss / args.accumulation_steps
-
-            # 计时前向传播结束 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                forward_end.record()
-
-            # 计时反向传播 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                backward_start.record()
-
-            # 反向传播
-            # 当使用DeepSpeed时，它会自动处理梯度累积和梯度裁剪
-            accelerator.backward(loss)
-
-            # 计时反向传播结束 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                backward_end.record()
-
-            # 计时优化器步骤 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                optimizer_start.record()
-
-            # 优化器步骤 - 当使用DeepSpeed时，它会自动处理梯度累积和梯度裁剪
-            # 只有在达到累积步数时才会执行优化器步骤
-            # 注意：当使用DeepSpeed时，它会自动处理梯度累积，所以我们不需要检查step % accumulation_steps
-            optimizer.step()
-
-            # 当使用DeepSpeed时，zero_grad()会在step()之后自动调用
-            # 但为了安全起见，我们仍然显式调用它
-            optimizer.zero_grad()
-
-            # 计时优化器步骤结束 (只在主进程进行)
-            if args.profile and accelerator.is_main_process:
-                optimizer_end.record()
-
-            # 打印训练信息 (只在主进程进行)
-            if (step + 1) % args.log_interval == 0 and accelerator.is_main_process:
-                current_time = time.time()
-                # 计算性能指标
-                if args.profile:
-                    torch.cuda.synchronize()
-                    # 使用自上次日志以来的时间计算性能指标，而不是总时间
-                    data_time = data_start.elapsed_time(data_end)
-                    forward_time = forward_start.elapsed_time(forward_end)
-                    backward_time = backward_start.elapsed_time(backward_end)
-                    optimizer_time = optimizer_start.elapsed_time(optimizer_end)
-                    iter_time = (current_time - last_log_time) * 1000 / args.log_interval # avg ms per iteration since last log
-                    # total_time_ms = data_time + forward_time + backward_time + optimizer_time
-
-                    # 打印性能分析
-                    if (step + 1) % (args.log_interval * args.profile_interval) == 0:
-                        Logger(f"性能分析 (Avg/iter over last {args.log_interval} steps) - "
-                              f"Data: {data_time/args.log_interval:.2f}ms, "
-                              f"Fwd: {forward_time/args.log_interval:.2f}ms, "
-                              f"Bwd: {backward_time/args.log_interval:.2f}ms, "
-                              f"Optim: {optimizer_time/args.log_interval:.2f}ms, "
-                              f"Iter Time: {iter_time:.2f}ms", accelerator)
-                        # 重置事件以便下次测量从0开始
-                        data_start = torch.cuda.Event(enable_timing=True)
-                        data_end = torch.cuda.Event(enable_timing=True)
-                        forward_start = torch.cuda.Event(enable_timing=True)
-                        forward_end = torch.cuda.Event(enable_timing=True)
-                        backward_start = torch.cuda.Event(enable_timing=True)
-                        backward_end = torch.cuda.Event(enable_timing=True)
-                        optimizer_start = torch.cuda.Event(enable_timing=True)
-                        optimizer_end = torch.cuda.Event(enable_timing=True)
-
-
-                # 计算当前学习率
-                current_lr = optimizer.param_groups[0]['lr']
-
-                # 计算时间
-                epoch_elapsed_time = current_time - epoch_start_time
-                epoch_steps_done = step + 1
-                epoch_avg_step_time = epoch_elapsed_time / epoch_steps_done
-                epoch_remaining_time = epoch_avg_step_time * (total_steps_in_epoch - epoch_steps_done)
-
-                total_elapsed_time = current_time - overall_start_time
-                total_steps_done = epoch * total_steps_in_epoch + epoch_steps_done
-                total_avg_step_time = total_elapsed_time / total_steps_done if total_steps_done > 0 else 0
-                total_remaining_time = total_avg_step_time * (total_training_steps - total_steps_done) if total_steps_done > 0 else 0
-
-                # 计算训练速度 (基于最近的log_interval)
-                interval_elapsed_time = current_time - last_log_time
-                tokens_processed_interval = args.log_interval * args.batch_size * args.max_seq_len
-                tokens_per_sec = tokens_processed_interval / interval_elapsed_time if interval_elapsed_time > 0 else 0
-                last_log_time = current_time # 更新上次日志时间
-
-                Logger(f"Epoch {epoch+1}/{args.epochs}, Step {step+1}/{total_steps_in_epoch}, "
-                      f"Loss: {loss.item()*args.accumulation_steps:.4f}, "
-                      f"LR: {current_lr:.6f}, "
-                      f"Speed: {tokens_per_sec:.2f} tokens/sec | "
-                      f"Epoch Time Left: {format_time(epoch_remaining_time)} | "
-                      f"Total Time Left: {format_time(total_remaining_time)}", accelerator)
-
-            # 保存模型 (只在主进程进行)
-            if (step + 1) % args.save_interval == 0 and accelerator.is_main_process:
-                # 使用函数开始处定义的moe_path变量
-                ckp = f'{args.save_dir}/pretrain_{args.dim}{moe_path}.pth'
-
-                # 获取解包后的模型
-                unwrapped_model = accelerator.unwrap_model(model)
-
-                # 保存模型参数
-                accelerator.save(unwrapped_model.state_dict(), ckp)
-                Logger(f"Model saved to {ckp}", accelerator)
-
-        except Exception as e:
-            Logger(f"Error in training step: {e}", accelerator)
-            import traceback
-            Logger(traceback.format_exc(), accelerator)
-
-def main():
-    parser = argparse.ArgumentParser(description="MiniMind Pretraining with Accelerate")
-    parser.add_argument("--out_dir", type=str, default="out")
-    parser.add_argument("--epochs", type=int, default=3)
-    parser.add_argument("--batch_size", type=int, default=24)
-    parser.add_argument("--learning_rate", type=float, default=2e-4)
-    parser.add_argument("--dtype", type=str, default="bfloat16")
-    parser.add_argument("--use_wandb", default=True, action="store_true")
-    parser.add_argument("--wandb_project", type=str, default="MiniMind-Pretrain")
-    parser.add_argument("--num_workers", type=int, default=48)
-    parser.add_argument("--accumulation_steps", type=int, default=32)
-    parser.add_argument("--grad_clip", type=float, default=1.0)
-    parser.add_argument("--warmup_iters", type=int, default=0)
-    parser.add_argument("--log_interval", type=int, default=100)
-    parser.add_argument("--save_interval", type=int, default=10000)
-    parser.add_argument('--dim', default=1024, type=int)
-    parser.add_argument('--n_layers', default=32, type=int)
-    parser.add_argument('--max_seq_len', default=1024, type=int)
-    parser.add_argument('--use_moe', default=False, type=bool)
-    parser.add_argument('--disable_db', action='store_true', help="禁用数据库功能，使用固定值1e-4替代")
-    parser.add_argument("--data_path", type=str, default="./dataset/pretrain_hq.jsonl")
-    parser.add_argument("--pretrained_embedding_path", type=str, default=None, help="Path to pretrained token embedding weights (.pth file)")
-    parser.add_argument("--profile", action="store_true", default=True, help="启用性能分析")
-    parser.add_argument("--profile_interval", type=int, default=10, help="性能分析打印间隔（步数）")
-    parser.add_argument("--use_flash_attn", action="store_true", default=True, help="启用FlashAttention")
-    parser.add_argument("--knowlwdge_num", type=int, default=64*64,help="知识库的数据数目")
-    parser.add_argument("--knowlwdge_length", type=int, default=8,help="知识库的句子长度")
-    args = parser.parse_args()
-
-    # 初始化accelerator
-    # 设置ddp_kwargs以处理未使用的参数
-    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
-    # 创建DeepSpeedPlugin对象
-    ds_plugin = DeepSpeedPlugin(
-        gradient_accumulation_steps=args.accumulation_steps,
-        gradient_clipping=args.grad_clip,
-        zero_stage=2,  # 使用ZeRO-2优化
-        offload_optimizer_device="cpu",  # 将优化器状态卸载到CPU
-        offload_param_device="none",  # 不将参数卸载到CPU
-    )
-    accelerator = Accelerator(
-        kwargs_handlers=[ddp_kwargs],
-        deepspeed_plugin=ds_plugin,
-        mixed_precision="bf16" if args.dtype == "bfloat16" else "fp16" if args.dtype == "float16" else "no"
-    )
-
-    # 设置随机种子
-    set_seed(1337 + accelerator.process_index)
-
-    # 配置模型
-    lm_config = LMConfig(
-        dim=args.dim,
-        n_layers=args.n_layers,
-        max_seq_len=args.max_seq_len,
-        use_moe=args.use_moe,
-        disable_db=args.disable_db,
-        flash_attn=args.use_flash_attn,
-        knowlwdge_num=args.knowlwdge_num,
-        knowlwdge_length=args.knowlwdge_length
-    )
-
-    # 创建保存目录
-    args.save_dir = os.path.join(args.out_dir)
-    if accelerator.is_main_process:
-        os.makedirs(args.save_dir, exist_ok=True)
-        os.makedirs(args.out_dir, exist_ok=True)
-
-    # 计算每次迭代的token数量
-    tokens_per_iter = args.batch_size * lm_config.max_seq_len
-    Logger(f"tokens_per_iter: {tokens_per_iter}", accelerator)
-
-    # 设置数据类型
-    pt_dtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[args.dtype]
-
-    # 设置wandb运行名称
-    args.wandb_run_name = f"MiniMind-Pretrain-Epoch-{args.epochs}-BatchSize-{args.batch_size}-LearningRate-{args.learning_rate}"
-
-    # 设置自动混合精度上下文
-    ctx = nullcontext() if accelerator.device.type == "cpu" else torch.cuda.amp.autocast(dtype=pt_dtype)
-
-    # 初始化模型和tokenizer
-    model, tokenizer = init_model(lm_config, args.pretrained_embedding_path)
-    # 将accelerator传递给init_model函数中的Logger调用
-    Logger(f'模型初始化完成', accelerator)
-
-    # 处理位置编码张量问题
-    # 我们已经将复数版本的pos_cis替换为实数版本的pos_cis_real
-    # 但为了安全起见，我们仍然将其设置为不参与分布式训练
-    if hasattr(model, "pos_cis_real"):
-        Logger(f'检测到pos_cis_real实数张量，将其设置为不参与分布式训练', accelerator)
-        # 设置模型的_ddp_params_and_buffers_to_ignore属性
-        model._ddp_params_and_buffers_to_ignore = {"pos_cis_real"}
-    # 兼容旧版本，检查是否仍有pos_cis
-    elif hasattr(model, "pos_cis"):
-        Logger(f'检测到pos_cis复数张量，将其设置为不参与分布式训练', accelerator)
-        # 设置模型的_ddp_params_and_buffers_to_ignore属性
-        model._ddp_params_and_buffers_to_ignore = {"pos_cis"}
-
-    # 创建数据集和数据加载器
-    train_ds = PretrainDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
-    train_loader = DataLoader(
-        train_ds,
-        batch_size=args.batch_size,
-        pin_memory=True,
-        drop_last=False,
-        shuffle=True,
-        num_workers=args.num_workers,
-        persistent_workers=True if args.num_workers > 0 else False,
-        prefetch_factor=2 if args.num_workers > 0 else None
-    )
-
-    # 创建优化器
-    optimizer = optim.AdamW(model.parameters(), lr=args.learning_rate)
-
-    # 创建学习率调度器
-    total_steps = len(train_loader) * args.epochs
-    warmup_steps = args.warmup_iters if args.warmup_iters > 0 else int(0.1 * total_steps)
-    scheduler = get_cosine_schedule_with_warmup(
-        optimizer,
-        num_warmup_steps=warmup_steps,
-        num_training_steps=total_steps
-    )
-
-    # 准备训练
-    model, optimizer, train_loader, scheduler = accelerator.prepare(
-        model, optimizer, train_loader, scheduler
-    )
-
-    # 初始化wandb
-    if args.use_wandb and accelerator.is_main_process:
-        import wandb
-        wandb.init(project=args.wandb_project, name=args.wandb_run_name, config=args)
-    else:
-        wandb = None
-
-    # 训练循环
-    overall_start_time = time.time() # Record overall start time
-    for epoch in range(args.epochs):
-        train_epoch(epoch, accelerator, model, train_loader, optimizer, scheduler, args, ctx, overall_start_time) # Pass overall start time
-
-    # 关闭wandb
-    if args.use_wandb and accelerator.is_main_process:
-        wandb.finish()
-
-if __name__ == "__main__":
-    main()