update readme

2025-02-10 11:14:52 +08:00 · 2025-02-10 11:14:52 +08:00 · c2e67c25f7
commit c2e67c25f7
parent b60fd11f85
3 changed files with 391 additions and 255 deletions
--- a/README.md
+++ b/README.md
@ -34,7 +34,8 @@
 * 这不仅是大语言模型的全阶段开源复现，也是一个入门LLM的教程。
 * 希望此项目能为所有人提供一个抛砖引玉的示例，一起感受创造的乐趣！推动更广泛AI社区的进步！
-  > 为防止误解，“2小时” 基于NVIDIA 3090硬件设备（单卡）测试，“3块钱” 指GPU服务器租用成本，具体规格详情见下文。
+> 为防止误解，“2小时” 基于NVIDIA 3090硬件设备（单卡）测试，“3块钱”
 > 指GPU服务器租用成本，具体规格详情见下文。
 ---
@ -72,7 +73,7 @@
 > [!NOTE]
 > （截至2025-02-07）MiniMind系列已完成多个型号模型的预训练，最小仅需25.8M（0.02B），即可具备流畅对话能力！
-<details>
+<details style="color:rgb(128,128,128)">
 <summary>Models List</summary>
 | 模型 (大小)                 | 推理占用 (约) | Release    | 
@ -172,19 +173,29 @@
 # 📌 快速开始
-<div align="center" style="font-size: 1.5em; font-weight: bold;">
+---
  <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
  Hugging Face
 [MiniMind (HuggingFace)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 <img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
 [MiniMind (ModelScope)](https://www.modelscope.cn/profile/gongjy)
 <div align="center" style="font-size: 1.5em; font-weight: bold; display: flex; align-items: center;">
  <a href="https://jingyaogong.github.io/minimind" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
    <img src="./images/logo2.png" alt="MiniMind Logo" style="vertical-align: middle; height: 30px;" />
    <span style="margin-left: 10px;color:#0877e5;font-style: italic">MiniMind Series</span>
  </a>
  <span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">×</span>
  &nbsp;
  <a href="https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
    <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
    <span style="margin-left: 10px;color:#d56100;font-style: italic">Hugging Face</span>
  </a>
  <span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">&</span>
  <a href="https://www.modelscope.cn/profile/gongjy" style="text-decoration: none; display: flex; align-items: center; margin-left: 20px;">
    <img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="ModelScope Logo" style="vertical-align: middle; height: 30px;" />
  </a>
 </div>
-<details>
+---
 <details style="color:rgb(128,128,128)">
 <summary>分享本人的软硬件配置（仅供参考）</summary>
 * CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
@ -222,9 +233,8 @@ python eval_model.py --load 1
 ### 3.或启动WebUI
 > 「注意」可能需要`python>=3.10` 安装 `pip install streamlit`
 ```bash
 # 可能需要`python>=3.10` 安装 `pip install streamlit`
 # cd scripts
 streamlit run web_demo.py
 ```
@ -233,63 +243,88 @@ streamlit run web_demo.py
 ### 1.环境准备
-  ```bash
+```bash
-  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-  ```
+```
-  <details>
+<details style="color:rgb(128,128,128)">
-    <summary>测试Torch是否可用cuda（展开）</summary>
+<summary>注：提前测试Torch是否可用cuda</summary>
-    ```bash
+```bash
-    import torch
+import torch
-    print(torch.cuda.is_available())
+print(torch.cuda.is_available())
-    ```
+```
 如果不可用，请自行去[torch_stable](https://download.pytorch.org/whl/torch_stable.html)
 下载whl文件安装。参考[链接](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
-  </details>
+</details>
 ### 2.数据下载
-从下文的[数据集下载链接](#数据集下载)下载需要的数据文件放到`./dataset`目录下
+从下文提供的[数据集下载链接](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files)
 下载需要的数据文件放到`./dataset`目录下
-> 【注】默认推荐下载`pretrain_hq.jsonl` + `sft_mini_512.jsonl`最快速度复现Zero聊天模型。
+<details style="color:rgb(128,128,128)">
 <summary>注：数据集须知</summary>
-> 【注】数据文件可自由选择，下文提供了多种搭配方案，可根据自己手头的训练需求和GPU资源进行适当组合。
+默认推荐下载`pretrain_hq.jsonl` + `sft_mini_512.jsonl`最快速度复现Zero聊天模型。
 数据文件可自由选择，下文提供了多种搭配方案，可根据自己手头的训练需求和GPU资源进行适当组合。
 </details>
 ### 3.开始训练
-**预训练（学知识）**
+**3.1 预训练（学知识）**
-  ```bash
+```bash
-  python train_pretrain.py
+python train_pretrain.py
-  ```
+```
-执行预训练，得到 `pretrain_*.pth` 作为预训练的输出权重（其中*为模型的dimension，默认为512）
+<span style="color:rgb(128,128,128)">
 > 执行预训练，得到 `pretrain_*.pth` 作为预训练的输出权重（其中*为模型的dimension，默认为512）
 </span>
 **监督微调（学对话方式）**
-  ```bash
+**3.2 监督微调（学对话方式）**
  python train_full_sft.py
  ```
-执行监督微调，得到 `full_sft_*.pth` 作为指令微调的输出权重（其中`full`即为全参数微调）
+```bash
 python train_full_sft.py
 ```
-> 【注】所有训练过程默认每隔100步保存1次参数到文件`./out/***.pth`（每次会覆盖掉旧权重文件）。
+<span style="color:rgb(128,128,128);">
 > 执行监督微调，得到 `full_sft_*.pth` 作为指令微调的输出权重（其中`full`即为全参数微调）
 </span>
-> 【注】简单起见，此处只写明两个阶段训练过程。如需其它训练 (LoRA, 蒸馏, 强化学习, 微调推理等) 可参考下文【实验】小节的详细说明。
+---
 <details style="color:rgb(128,128,128)">
 <summary>注：训练须知</summary>
 所有训练过程默认每隔100步保存1次参数到文件`./out/***.pth`（每次会覆盖掉旧权重文件）。
 简单起见，此处只写明两个阶段训练过程。如需其它训练 (LoRA, 蒸馏, 强化学习, 微调推理等) 可参考下文【实验】小节的详细说明。
 </details>
 ### 4.测试模型效果
 确保需要测试的模型`*.pth`文件位于`./out/`目录下。
 也可以直接去[此处](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files)下载使用我训练的`*.pth`文件。
-  ```bash
+```bash
-  python eval_model.py --model_mode 1 # 默认为0：测试pretrain模型效果，设置为1：测试full_sft模型效果
+python eval_model.py --model_mode 1 # 默认为0：测试pretrain模型效果，设置为1：测试full_sft模型效果
-  ```
+```
 <details style="color:rgb(128,128,128)">
 <summary>注：测试须知</summary>
 如需详情，查看`eval_model.py`脚本代码即可。model_mode分为 0: 预训练模型，1: SFT-Chat模型，2: RLHF-Chat模型，3: Reason模型
 </details>
 > 【注】详情查看`eval_model.py`脚本代码即可。model_mode分为 0: 预训练模型，1: SFT-Chat模型，2: RLHF-Chat模型，3: Reason模型
 ---
@ -302,8 +337,8 @@ streamlit run web_demo.py
 torchrun --nproc_per_node 3 train_xxx.py
 ```
-<details>
+<details style="color:rgb(128,128,128)">
-<summary>其它（展开）</summary>
+<summary>注：其它须知</summary>
 * 单机N卡启动训练 (DeepSpeed)
  ```bash
@ -334,8 +369,9 @@ torchrun --nproc_per_node 3 train_xxx.py
 五个独立的token），且生僻词难以覆盖。
 “词典”的选择固然很重要，LLM的输出本质上是SoftMax到词典N个词的多分类问题，然后通过“词典”解码到自然语言。
 因为MiniMind体积需要严格控制，为了避免模型头重脚轻（词嵌入embedding层参数在LLM占比太高），所以词表长度短短益善。
-<details>
+
-<summary>Tokenizer介绍（展开）</summary>
+<details style="color:rgb(128,128,128)">
 <summary>Tokenizer介绍</summary>
 第三方强大的开源模型例如Yi、qwen、chatglm、mistral、Llama3的tokenizer词表长度如下：
@ -451,11 +487,9 @@ quality（当然也还不算high，提升数据质量无止尽）。
 > [!NOTE]
 > 2025-02-05后，开源MiniMind最终训练所用的所有数据集，因此无需再自行预处理大规模数据集，避免重复性的数据处理工作。
-MiniMind训练数据集
+MiniMind训练数据集 ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong))
-[ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files)
+
-|
+<span style="color:rgb(128,128,128)">> 无需全部clone，可单独下载所需的文件</span>
 [HuggingFace](https://huggingface.co/datasets/jingyaogong)
 （无需全部clone，可单独下载所需的文件）
 将下载的数据集文件放到`./dataset/`目录下（✨为推荐的必须项）
@ -473,8 +507,8 @@ MiniMind训练数据集
 └── tokenizer_train.jsonl (1GB)
 ```
-<details>
+<details style="color:rgb(128,128,128)">
-  <summary>各数据集简介（展开）</summary>
+<summary>注：各数据集简介</summary>
 * `dpo.jsonl` --RLHF阶段数据集
 * `lora_identity.jsonl` --自我认知数据集（例如：你是谁？我是minimind...），推荐用于lora训练（亦可用于全参SFT，勿被名字局限）
@ -492,7 +526,8 @@ MiniMind训练数据集
 ![dataset](./images/dataset.jpg)
-**说明**
+<details style="color:rgb(128,128,128)">
 <summary>说明 & 推荐训练方案</summary>
 * MiniMind2 Series均经过共约20GB语料训练，大约4B tokens，即对应上面的数据组合训练结果（开销：💰💰💰💰💰💰💰💰，效果：😊😊😊😊😊😊）
@ -502,6 +537,8 @@ MiniMind训练数据集
 * 【折中方案】亦可选择例如`sft_mini_512.jsonl`、`sft_1024.jsonl`中等规模数据进行自由组合训练（开销：💰💰💰，效果：😊😊😊😊）。
 </details>
 # 📌 Model Structure
 MiniMind-Dense（和[Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/)一样）使用了Transformer的Decoder-Only结构，跟GPT-3的区别在于：
@ -541,7 +578,7 @@ MiniMind的整体结构一致，只是在RoPE计算、推理函数和FFN层的
 ## Ⅰ 训练开销
 - **时间单位**：小时 (h)。
- **成本单位**：人民币 (￥)；美元 ($)；7￥ ≈ 1$。
+- **成本单位**：人民币 (￥)；7￥ ≈ 1美元。
 - **3090 租卡单价**：≈1.3￥/h（可自行参考实时市价）。
 - **参考标准**：表格仅实测 `pretrain` 和 `sft_mini_512` 两个数据集的训练时间，其它耗时根据数据集大小估算（可能存在些许出入）。
@ -554,28 +591,40 @@ MiniMind的整体结构一致，只是在RoPE计算、推理函数和FFN层的
 ---
 <details style="color:rgb(128,128,128)">
 <summary>训练开销总结&预测</summary>
 > MiniMind2-Small参数
->> `pretrain_hq.jsonl`+`sft_mini_512.jsonl`数据集
+>> `pretrain_hq`+`sft_mini_512`数据集
 <br/>单卡3090 (1 epoch) + 2.1小时 + 花费2.73元人民币
 <br/>即可从0训练出MiniMind-Zero-0.025B模型!!!
 > MiniMind2-Small参数
->> `pretrain_hq.jsonl`+`sft_512.jsonl`+`sft_2048.jsonl`+`dpo.jsonl`数据集
+>> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo`数据集
 <br/>单卡3090 (2 epochs) + 大约38.16小时 + 花费49.61元人民币
 <br/>即可从0训练出MiniMind2-Small-0.025B模型!!!
 > MiniMind2参数
->> `pretrain_hq.jsonl`+`sft_512.jsonl`+`sft_2048.jsonl`+`dpo.jsonl`数据集
+>> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo`数据集
 <br/>单卡3090 (2 epochs) + 大约122小时 + 花费158.6元人民币
 <br/>即可从0训练出MiniMind2-0.1B模型!!!
-✨基于单卡NVIDIA 3090的`MiniMind-Zero`从0训练仅需`2小时` + `3块钱`，实现ChatBot效果！
+</details>
 ✨<span style="color:rgb(0,147,64)">
 基于单卡NVIDIA 3090的`MiniMind-Zero`从0训练仅需`2小时` + `3块钱`，实现ChatBot效果！
 </span>
 ✨PS：若采用8卡4090训练，总用时甚至可以压缩到10分钟以内！（由于时间更短，花费同样3元左右，与单卡成本相当）
 ✨以极低极低的门槛，实现人人可玩的大模型自由！这正是MiniMind系列的诞生初衷！
-✨仅价值`3块钱`成本的`MiniMind-Zero`并不只是噱头！Chat测试：
+✨<span style="color:rgb(0,147,64)">
 仅价值`3块钱`成本的`MiniMind-Zero`并不是噱头！Chat测试：
 </span>
 ```textmate
 👶: 请介绍一下自己。
@ -591,10 +640,12 @@ MiniMind的整体结构一致，只是在RoPE计算、推理函数和FFN层的
 🤖️: 您提到的“Introok's the believeations of theument." 这个名字来源于中国古代的"groty of of the change."
 ```
 <span style="color:rgb(0,147,64)">
 极速且初具效果，甚至仍然可以进一步压缩获取更小更优质的训练数据。
-
+</span>
 Zero模型权重保存为 `full_sft_512_zero.pth`（见下文MiniMind模型文件链接），如有兴趣可下载检验此模型效果。
 ---
 ## Ⅱ 主要训练步骤
@ -612,7 +663,10 @@ torchrun --nproc_per_node 1 train_pretrain.py # 1即为单卡训练，可根据
 python train_pretrain.py
 ```
-训练后的模型权重文件默认每隔`100步`保存为: `pretrain_*.pth`（*为模型具体dimension，每次保存时新文件会覆盖旧文件）。
+<span style="color:rgb(128,128,128)">
 > 训练后的模型权重文件默认每隔`100步`保存为: `pretrain_*.pth`（*
 为模型具体dimension，每次保存时新文件会覆盖旧文件）
 </span>
 ### **2. 有监督微调(Supervised Fine-Tuning)**:
@ -630,7 +684,8 @@ torchrun --nproc_per_node 1 train_full_sft.py
 python train_full_sft.py
 ```
-训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`（*为模型具体dimension，每次保存时新文件会覆盖旧文件）。
+<span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`（*
 为模型具体dimension，每次保存时新文件会覆盖旧文件）</span>
 ## Ⅲ 其它训练步骤
@ -643,7 +698,10 @@ python train_full_sft.py
 与PPO(Proximal Policy Optimization)这种需要奖励模型、价值模型的RL算法不同；
 DPO通过推导PPO奖励模型的显式解，把在线奖励模型换成离线数据，Ref模型输出可以提前保存。
 DPO性能几乎不变，只用跑 actor_model 和 ref_model 两个模型，大大节省显存开销和增加训练稳定性。
-> 【注】RLHF训练步骤**并非必须**，此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”，有利（符合偏好、减少有害内容）也有弊（样本收集昂贵、反馈偏差、多样性损失）。
+
 <span style="color:rgb(128,128,128)">
 注：RLHF训练步骤**并非必须**，此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”，有利（符合偏好、减少有害内容）也有弊（样本收集昂贵、反馈偏差、多样性损失）。
 </span>
 ```bash
 torchrun --nproc_per_node 1 train_dpo.py
@ -651,7 +709,8 @@ torchrun --nproc_per_node 1 train_dpo.py
 python train_dpo.py
 ```
-训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`（*为模型具体dimension，每次保存时新文件会覆盖旧文件）。
+<span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`（*
 为模型具体dimension，每次保存时新文件会覆盖旧文件）</span>
 ### **4. 知识蒸馏(Knowledge Distillation, KD)**
@ -678,7 +737,8 @@ torchrun --nproc_per_node 1 train_full_sft.py
 python train_full_sft.py
 ```
-训练后的模型权重文件默认每隔`100步`同样保存为: `full_sft_*.pth`（*为模型具体dimension，每次保存时新文件会覆盖旧文件）。
+<span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`同样保存为: `full_sft_*.pth`（*
 为模型具体dimension，每次保存时新文件会覆盖旧文件）</span>
 此处应当着重介绍MiniMind实现的白盒蒸馏代码`train_distillation.py`，由于MiniMind同系列本身并不存在强大的教师模型，因此白盒蒸馏代码仅作为学习参考。
@ -701,7 +761,10 @@ torchrun --nproc_per_node 1 train_lora.py
 python train_lora.py
 ```
-训练后的模型权重文件默认每隔`100步`保存为: `lora_xxx_*.pth`（*为模型具体dimension，每次保存时新文件会覆盖旧文件）。
+<span style="color:rgb(128,128,128)">
 > 训练后的模型权重文件默认每隔`100步`保存为: `lora_xxx_*.pth`（*
 为模型具体dimension，每次保存时新文件会覆盖旧文件）
 </span>
 非常多的人困惑，如何使模型学会自己私有领域的知识？如何准备数据集？如何迁移通用领域模型打造垂域模型？
 这里举几个例子，对于通用模型，医学领域知识欠缺，可以尝试在原有模型基础上加入领域知识，以获得更好的性能。
@ -801,7 +864,8 @@ torchrun --nproc_per_node 1 train_distill_reason.py
 python train_distill_reason.py
 ```
-训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`（*为模型具体dimension，每次保存时新文件会覆盖旧文件）。
+<span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`（*
 为模型具体dimension，每次保存时新文件会覆盖旧文件）</span>
 测试一下：
@ -858,16 +922,20 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长
 ## Ⅴ 训练结果
-> `MiniMind2` 模型训练损失走势（由于数据集在训练后又更新清洗多次，因此Loss仅供参考）
+<span style="color:rgb(128,128,128)"> 
 MiniMind2 模型训练损失走势（由于数据集在训练后又更新清洗多次，因此Loss仅供参考）
 </span>
 | models          | pretrain (length-512)                              | sft (length-512)                                   |
 |-----------------|----------------------------------------------------|----------------------------------------------------|
 | MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> |
 | MiniMind2       | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> |
-### 【训练完成】模型CheckPoint合集
+### 训练完成-模型合集
 <span style="color:rgb(128,128,128)">
 > 考虑到多人反应百度网盘速度慢，MiniMind2及以后全部使用ModelScope/HuggingFace托管。
 </span>
 #### PyTorch原生模型
@ -875,17 +943,12 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长
 * [MiniMind-V1系列 (百度网盘)](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)
-#### Transformers模型
+<details style="color:rgb(128,128,128)">
-
+<summary>Torch文件命名对照</summary>
 * MiniMind系列 [(ModelScope)](https://www.modelscope.cn/profile/gongjy)
  | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 <details>
 <summary>Torch文件命名对照（展开）</summary>
 | Model Name      | params | pretrain_model         | sft_model              | rl_model           | reason_model     | lora_model         |
 |-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------|
-| MiniMind2-small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | -                | `lora_xxx_512.pth` |
+| MiniMind2-small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | `reason_512.pth` | `lora_xxx_512.pth` |
 | MiniMind2-MoE   | 145M   | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | -                | -                  |
 | MiniMind2       | 104M   | `pretrain_768.pth`     | `full_sft_768.pth`     | `rlhf_768.pth`     | `reason_768.pth` | `lora_xxx_768.pth` |
@ -897,6 +960,10 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长
 </details>
 #### Transformers模型
 * MiniMind系列 ([ModelScope](https://www.modelscope.cn/profile/gongjy)
  | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5))
 ---
@ -904,13 +971,14 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长
 ## Ⅰ RLHF对比SFT篇
-> [!NOTE]
+测试基于`full_sft`和`rlhf`的`MiniMind2系列`模型对比，测试随机种子均固定
 > 测试基于「full_sft」和「rlhf」的`MiniMind2系列`模型对比，测试随机种子固定为`42`
-<details>
+<details style="color:rgb(128,128,128)">
-  <summary>问答实测（展开）</summary>
+<summary>问答实测</summary>
-[A0]: full_sft_640, [A1]: rlhf_640
+[A0] MiniMind2：full_sft_640
 [A1] MiniMind2：rlhf_640
  ```text
  [Q]: 你叫什么名字？
@ -942,8 +1010,10 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind
 ## Ⅱ 主观样例测评
-> [!NOTE]
+<span style="color:rgb(128,128,128)">
-> 以下测试于2025-02-09完成，此日期后发布的新模型，无特殊需要时将不加入测试。
+🏃以下测试于2025-02-09完成，此日期后发布的新模型，无特殊需要时将不加入测试。
 </span>
 [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
 [B] [MiniMind2-MoE (0.15B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
@ -1032,13 +1102,17 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind
 [H]: 我是一个参数很少的AI模型🥺，知识库较少，无法直接回答您的问题，换个问题试试吧👋
 ```
-> [!NOTE]
+---
-> 🙋‍♂️直接把上述模型的回答丢给DeepSeek-R1，让它帮忙打个分：
+
 <span style="color:rgb(128,128,128)">
 🙋‍直接把以上所有问题和模型的回答丢给DeepSeek-R1，让它帮忙点评和排名打分：
 </span>
 ---
-<details>
+
-  <summary>点评（展开）</summary>
+<details style="color:rgb(128,128,128)">
 <summary>具体点评</summary>
 ### 评分标准：
@ -1116,11 +1190,16 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind
 ### 👉主观效果总结
-> 🙋‍♂️个人主观评价与DeepSeek-R1相符
+<span style="color:rgb(229,111,8)">
 个人主观评价与DeepSeek-R1基本相符，其中：
 </span>
 * MiniMind系列的排序非常符合直觉，参数越大+训练数据越充分评分越高，幻觉和错误都会比小模型肉眼可见的好。
 * H模型的回答肉眼看起来是不错的，尽管存在些许幻觉瞎编的情况。
 * G模型可能训练数据不够完备，给出的权重经过测试效果不佳。
 * 再复诵一遍经久不衰的Scaling Law: 参数越大，训练数据越多模型的性能越强。
 ## Ⅲ Objective Benchmark
@ -1130,8 +1209,8 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind
 测试集选择C-Eval、CMMLU、A-CLUE、TMMLU+这几个纯中文语言榜单。
-<details>
+<details style="color:rgb(128,128,128)">
-<summary>测评框架（展开）</summary>
+<summary>测评框架</summary>
 测评框架选择[lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness)，
 安装后启动测试非常方便：
--- a/README_en.md
+++ b/README_en.md
@ -58,35 +58,28 @@
 # 📌 Introduction
-The emergence of Large Language Models (LLMs) has sparked unprecedented global attention on AI.  
+The emergence of Large Language Models (LLMs) has sparked unprecedented global attention on AI. Whether it's ChatGPT,
-Whether it's ChatGPT, DeepSeek, or Qwen, their stunning performance leaves people in awe.  
+DeepSeek, or Qwen, their stunning performance leaves people in awe. However, the massive scale of hundreds of billions
-However, the massive scale of hundreds of billions of parameters makes it not only difficult to train them on personal
+of parameters makes it not only difficult to train them on personal devices, but also almost impossible to deploy them.
-devices, but also almost impossible to deploy them.  
+Opening the "black box" of large models and exploring their inner workings is exhilarating! Sadly, 99% of explorations
-Opening the "black box" of large models and exploring their inner workings is exhilarating!  
+can only stop at fine-tuning existing large models with techniques like LoRA, learning a few new commands or tasks. It's
-Sadly, 99% of explorations can only stop at fine-tuning existing large models with techniques like LoRA, learning a few
+like teaching Newton how to use a 21st-century smartphone—though interesting, it completely deviates from the original
-new commands or tasks.  
+goal of understanding the essence of physics. Meanwhile, third-party large model frameworks and toolkits, such as
-It's like teaching Newton how to use a 21st-century smartphone—though interesting, it completely deviates from the
+transformers+trl, almost only expose highly abstract interfaces. With just 10 lines of code, you can complete the entire
-original goal of understanding the essence of physics.  
+training process of "loading model + loading dataset + inference + reinforcement learning". While this efficient
-Meanwhile, third-party large model frameworks and toolkits, such as transformers+trl, almost only expose highly abstract
+encapsulation is convenient, it's like a high-speed spaceship, isolating us from the underlying implementation and
-interfaces.  
+hindering our opportunity to dive deep into the core code of LLMs. However, "building a plane with Legos is far more
-With just 10 lines of code, you can complete the entire training process of "loading model + loading dataset +
+exciting than flying first-class!" What's worse, the internet is flooded with paid courses and marketing accounts,
-inference + reinforcement learning".  
+selling AI tutorials with flawed and half-understood content. Therefore, the goal of this project is to lower the
-While this efficient encapsulation is convenient, it's like a high-speed spaceship, isolating us from the underlying
+learning threshold for LLMs, allowing everyone to start by understanding each line of code, and to train a very small
-implementation and hindering our opportunity to dive deep into the core code of LLMs.  
+language model from scratch, not just performing **inference**! With server costs of less than 3 RMB, you can experience
-However, "building a plane with Legos is far more exciting than flying first-class!"  
+the entire process of building a language model from 0 to 1. Let's enjoy the fun of creation together!
 What's worse, the internet is flooded with paid courses and marketing accounts, selling AI tutorials with flawed and
 half-understood content.  
 Therefore, the goal of this project is to lower the learning threshold for LLMs, allowing everyone to start by
 understanding each line of code,  
 and to train a very small language model from scratch, not just performing **inference**!  
 With server costs of less than 3 RMB, you can experience the entire process of building a language model from 0 to 1.  
 Let's enjoy the fun of creation together!
 > [!NOTE]  
 > (As of 2025-02-07) The MiniMind series has completed pretraining for multiple models, with the smallest one being only
 > 25.8M (0.02B) and capable of smooth conversation!
-<details>
+<details style="color:rgb(128,128,128)">
 <summary>Models List</summary>
 | Model (Size)            | Inference Usage (Approx.) | Release    | 
@ -120,7 +113,7 @@ Let's enjoy the fun of creation together!
 We hope this open-source project can help LLM beginners quickly get started!
-### 👉**Changelog**
+### 👉**Update log**
 <details close> 
 <summary> <b>2025-02-09 (newest 🎉🎉🎉)</b> </summary>
@ -191,19 +184,28 @@ We hope this open-source project can help LLM beginners quickly get started!
 # 📌 Quick Start
-<div align="center" style="font-size: 1.5em; font-weight: bold;">
+---
  <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
  Hugging Face
 [MiniMind (HuggingFace)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 <img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
 [MiniMind (ModelScope)](https://www.modelscope.cn/profile/gongjy)
 <div align="center" style="font-size: 1.5em; font-weight: bold; display: flex; align-items: center;">
  <a href="https://jingyaogong.github.io/minimind" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
    <img src="./images/logo2.png" alt="MiniMind Logo" style="vertical-align: middle; height: 30px;" />
    <span style="margin-left: 10px;color:#0877e5;font-style: italic">MiniMind Series</span>
  </a>
  <span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">×</span>
  &nbsp;
  <a href="https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
    <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
    <span style="margin-left: 10px;color:#d56100;font-style: italic">Hugging Face</span>
  </a>
  <span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">&</span>
  <a href="https://www.modelscope.cn/profile/gongjy" style="text-decoration: none; display: flex; align-items: center; margin-left: 20px;">
    <img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="ModelScope Logo" style="vertical-align: middle; height: 30px;" />
  </a>
 </div>
-<details>
+---
 <details style="color:rgb(128,128,128)">
 <summary>Sharing My Hardware and Software Configuration (For Reference Only)</summary>
 * CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
@ -241,84 +243,105 @@ python eval_model.py --load 1
 ### 3. Or Start WebUI
 > **Note**: You may need `python>=3.10` and install `pip install streamlit`.
 ```bash
 # You may need `python>=3.10` and install `pip install streamlit`.
 # cd scripts
 streamlit run web_demo.py
 ```
-## Ⅱ Train From Scratch
+## Ⅱ Training from Scratch
 ### 1. Environment Setup
-  ```bash
+```bash
-  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-  ```
+```
-  <details>
+<details style="color:rgb(128,128,128)">
-    <summary>Test if Torch Can Use CUDA (Expand)</summary>
+<summary>Note: Test if Torch can use CUDA</summary>
-    ```bash
+```bash
-    import torch
+import torch
-    print(torch.cuda.is_available())
+print(torch.cuda.is_available())
-    ```
+```
-If CUDA is unavailable, please manually download the `.whl` file
+If CUDA is not available, please download the `.whl` file
-from [torch_stable](https://download.pytorch.org/whl/torch_stable.html) and install it. See
+from [torch_stable](https://download.pytorch.org/whl/torch_stable.html) and install it. Refer to
-the [link](https://blog.csdn.net/weixin_45456738/article/details/141029610) for instructions.
+this [link](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
 for guidance.
-  </details>
+</details>
 ### 2. Data Download
-Download the necessary data files from the [Dataset Download Link](#数据集下载) and place them in the `./dataset`
+Download the required data files from
-directory.
+the [dataset download link](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) and place them in
 the `./dataset` directory.
-> **Note**: The recommended fastest way to replicate the Zero Chat model is by
+<details style="color:rgb(128,128,128)">
-> downloading `pretrain_hq.jsonl` + `sft_mini_512.jsonl`.
+<summary>Note: Dataset Information</summary>
-> **Note**: You can freely choose from multiple dataset options below based on your training needs and GPU resources.
+By default, it is recommended to download `pretrain_hq.jsonl` + `sft_mini_512.jsonl` for the fastest Zero-chat model
 reproduction.
 You can freely choose data files. Various combinations are provided below, and you can select according to your training
 needs and GPU resources.
 </details>
 ### 3. Start Training
-**Pretraining (Learning Knowledge)**
+**3.1 Pretraining (Learning Knowledge)**
-  ```bash
+```bash
-  python train_pretrain.py
+python train_pretrain.py
-  ```
+```
-Run pretraining to get `pretrain_*.pth` as the output weight file (where * represents the model dimension, default is
+<span style="color:rgb(128,128,128)">
-512).
+> Execute pretraining to get `pretrain_*.pth` as the output weights for pretraining (where * represents the model dimension, default is 512).
 </span>
-**Supervised Fine-tuning (Learning Dialogue Style)**
+**3.2 Supervised Fine-Tuning (Learning Dialogue Style)**
-  ```bash
+```bash
-  python train_full_sft.py
+python train_full_sft.py
-  ```
+```
-Run supervised fine-tuning to get `full_sft_*.pth` as the output weight file (where `full` represents full parameter
+<span style="color:rgb(128,128,128);">
-fine-tuning).
+> Execute supervised fine-tuning to get `full_sft_*.pth` as the output weights for instruction fine-tuning (where `full` represents full parameter fine-tuning).
 </span>
-> **Note**: During training, every 100 steps, the model parameters will be saved to the `./out/***.pth` file (each time
+---
 > overwriting the old weight file).
-> **Note**: For simplicity, only two stages of training are mentioned here. For other training processes (LoRA,
+<details style="color:rgb(128,128,128)">
-> distillation, reinforcement learning, fine-tuning inference, etc.), please refer to the **Experiments** section for
+<summary>Note: Training Information</summary>
 > detailed instructions.
-### 4. Test the Model
+By default, during training, the model parameters are saved every 100 steps to `./out/***.pth` (each time overwriting
 the old weight file).
-Ensure the model `*.pth` file is in the `./out/` directory.
+For simplicity, only the two training stages are listed here. For other training methods (LoRA, distillation,
-You can also directly download my pre-trained `*.pth` files
+reinforcement learning, fine-tuning inference, etc.), refer to the detailed explanation in the [Experiments] section
 below.
 </details>
 ### 4. Testing Model Performance
 Ensure that the model `*.pth` file you want to test is located in the `./out/` directory.
 Alternatively, you can download and use the `*.pth` files I trained
 from [here](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files).
-  ```bash
+```bash
-  python eval_model.py --model_mode 1 # Default is 0: test pretrain model, set to 1: test full_sft model
+python eval_model.py --model_mode 1 # Default is 0: Test pretrain model, set to 1: Test full_sft model
-  ```
+```
-> **Note**: For more details, refer to the `eval_model.py` script. `model_mode` options are: 0: Pretrained Model, 1:
+<details style="color:rgb(128,128,128)">
-> SFT-Chat Model, 2: RLHF-Chat Model, 3: Reasoning Model.
+<summary>Note: Testing Information</summary>
 For more details, you can check the `eval_model.py` script code. The model_mode options are 0: Pretraining model, 1:
 SFT-Chat model, 2: RLHF-Chat model, 3: Reason model.
 </details>
 ---
@ -332,8 +355,8 @@ Start training with N GPUs on a single machine (DDP, supports multi-node, multi-
 torchrun --nproc_per_node 3 train_xxx.py
 ```
-<details>
+<details style="color:rgb(128,128,128)">
-<summary>Others (Expand)</summary>
+<summary>Note: Others</summary>
 * Start training with N GPUs on a single machine (DeepSpeed):
  ```bash
@ -373,8 +396,8 @@ vocabulary, with the model decoding the final output into natural language. Sinc
 strictly controlled, the vocabulary length should be kept short to avoid the embedding layer dominating the model's
 overall parameters. Thus, a smaller vocabulary size is beneficial.
-<details>
+<details style="color:rgb(128,128,128)">
-<summary>Tokenizer Details (Expand)</summary>
+<summary>Tokenizer Details</summary>
 Here are the vocabulary sizes of several popular open-source models:
@ -518,8 +541,8 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ required f
 └── tokenizer_train.jsonl (1GB)
 ```
-<details>
+<details style="color:rgb(128,128,128)">
-  <summary>Dataset Descriptions (Expand)</summary>
+  <summary>Dataset Descriptions</summary>
 * `dpo.jsonl` -- RLHF dataset
 * `lora_identity.jsonl` -- Self-identity dataset (e.g., "Who are you? I'm MiniMind..."), recommended for LoRA training (
@ -538,21 +561,24 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ required f
 ![dataset](./images/dataset.jpg)
-**Explanation**
+<details style="color:rgb(128,128,128)">
 <summary>Explanation & Recommended Training Plans</summary>
-* MiniMind2 Series has been trained on approximately 20GB of corpus, around 4B tokens, corresponding to the above data
+* The MiniMind2 Series has been trained on approximately 20GB of corpus, or about 4B tokens, corresponding to the data
-  combination training results (Cost: 💰💰💰💰💰💰💰💰, Effect: 😊😊😊😊😊😊)
+  combination results above (Cost: 💰💰💰💰💰💰💰💰, Effect: 😊😊😊😊😊😊).
-* For the fastest way to implement the Zero model from scratch, it is recommended to use the data combination
+* For the fastest Zero-model implementation from scratch, it is recommended to use the data combination
-  of `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. Specific costs and effects can be seen in the table below (Cost: 💰,
+  of `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. The specific costs and effects can be seen in the table below (Cost:
-  Effect: 😊😊)
+  💰, Effect: 😊😊).
-* For those with sufficient computational resources or who care more about performance, it is recommended to fully
+* For those with sufficient computational resources or more focus on results, it is advisable to fully reproduce
-  reproduce MiniMind2. If you only have a single GPU card or need to reproduce quickly in a short time, the latter is
+  MiniMind2 with the first option; if you only have a single GPU or prefer a quick reproduction within a short time, the
-  strongly recommended.
+  second option is strongly recommended.
-* [Compromise solution] You can also choose medium-sized data such as `sft_mini_512.jsonl` and `sft_1024.jsonl` for
+* [Compromise Plan] You can also freely combine medium-sized data like `sft_mini_512.jsonl`, `sft_1024.jsonl` for
-  flexible combinations (Cost: 💰💰💰, Effect: 😊😊😊😊).
+  training (Cost: 💰💰💰, Effect: 😊😊😊😊).
 </details>
 # 📌 Model Structure
@ -599,7 +625,7 @@ Reference model parameter versions are shown in the table below:
 ## Ⅰ Training Cost
 - **Time Unit**: Hours (h).
- **Cost Unit**: RMB (￥); USD ($); 7￥ ≈ 1$.
+- **Cost Unit**: RMB (￥); 7￥ ≈ 1 USD.
 - **3090 Rental Unit Price**: ≈ 1.3￥/h (subject to real-time market rates).
 - **Reference Standard**: The table only shows the actual training time for the `pretrain` and `sft_mini_512` datasets.
  Other times are estimated based on dataset size (there may be some discrepancies).
@ -613,50 +639,60 @@ Reference model parameter versions are shown in the table below:
 ---
-> MiniMind2-Small Parameters
+<details style="color:rgb(128,128,128)">
->> `pretrain_hq.jsonl` + `sft_mini_512.jsonl` dataset  
+<summary>Training Cost Summary & Prediction</summary>
 <br/>Single card 3090 (1 epoch) + 2.1 hours + cost 2.73 RMB  
 <br/>Can train the MiniMind-Zero-0.025B model from scratch!!!
 > MiniMind2-Small Parameters
->> `pretrain_hq.jsonl` + `sft_512.jsonl` + `sft_2048.jsonl` + `dpo.jsonl` dataset  
+>> `pretrain_hq` + `sft_mini_512` dataset  
-<br/>Single card 3090 (2 epochs) + approx. 38.16 hours + cost 49.61 RMB  
+<br/>Single 3090 GPU (1 epoch) + 2.1 hours + Cost: 2.73 RMB  
-<br/>Can train the MiniMind2-Small-0.025B model from scratch!!!
+<br/>You can train the MiniMind-Zero-0.025B model from scratch!!!
 > MiniMind2-Small Parameters
 >> `pretrain_hq` + `sft_512` + `sft_2048` + `dpo` dataset  
 <br/>Single 3090 GPU (2 epochs) + approximately 38.16 hours + Cost: 49.61 RMB  
 <br/>You can train the MiniMind2-Small-0.025B model from scratch!!!
 > MiniMind2 Parameters
->> `pretrain_hq.jsonl` + `sft_512.jsonl` + `sft_2048.jsonl` + `dpo.jsonl` dataset  
+>> `pretrain_hq` + `sft_512` + `sft_2048` + `dpo` dataset  
-<br/>Single card 3090 (2 epochs) + approx. 122 hours + cost 158.6 RMB  
+<br/>Single 3090 GPU (2 epochs) + approximately 122 hours + Cost: 158.6 RMB  
-<br/>Can train the MiniMind2-0.1B model from scratch!!!
+<br/>You can train the MiniMind2-0.1B model from scratch!!!
-✨With a single card NVIDIA 3090, training `MiniMind-Zero` from scratch takes only `2 hours` + `3 RMB`, achieving ChatBot
+</details>
 functionality!
 ✨PS: If 8 cards of 4090 are used, total training time can be compressed to under 10 minutes! (Due to the shorter time,
 the cost remains around 3 RMB, similar to the single card cost)
-✨Achieving large model accessibility with an extremely low entry barrier! This is the core intention behind the creation
+✨<span style="color:rgb(0,147,64)">
-of the MiniMind series!
+With a single NVIDIA 3090 GPU, you can train `MiniMind-Zero` from scratch in just `2 hours` and for a cost of
 only `3 RMB`, achieving ChatBot functionality!
 </span>
-✨The `MiniMind-Zero`, costing only `3 RMB`, is not just a gimmick! Chat test:
+✨PS: If training on 8 GPUs with 4090s, the total time can be compressed to under 10 minutes! (Despite the shorter time,
 the cost is still around 3 RMB, which is comparable to the single GPU cost.)
 ✨This enables ultra-low barriers to entry, making it possible for everyone to experiment with large models! This is the
 original purpose behind the creation of the MiniMind series!
 ✨<span style="color:rgb(0,147,64)">
 The `MiniMind-Zero` model, which costs only `3 RMB`, is not a gimmick! Chat test results:
 </span>
 ```textmate
-👶: Introduce yourself.
+👶: Please introduce yourself.
-🤖️: As an AI, I don’t have a real life or self-awareness, so I don’t have a life of my own. I was designed to help users answer questions, provide information, engage in conversation, etc. My design and functions are based on computer science, machine learning, AI technologies, and algorithms, implemented through programming and algorithms.
+🤖️: As an artificial intelligence, I don't have a real life or self-awareness, so I don't have my own life. I am designed to help users answer questions, provide information, engage in conversations, etc. My design and functionality are built using computer science, machine learning, artificial intelligence technologies, and algorithms, which I implement through programming and algorithms.
-👶: Recommend some famous dishes from Hangzhou.
+👶: Recommend some local specialties from Hangzhou.
-🤖️: Hangzhou has many delicious local dishes, such as chicken skewers, eel heads, boiled fish, tofu pudding, etc. These dishes are very tasty and have unique flavors and textures, definitely worth trying.
+🤖️: Hangzhou has many delicious local specialties, such as chicken skewers, eel heads, water-boiled fish, tofu pudding, etc. These foods are all very tasty, with unique flavors and textures, definitely worth trying.
-👶: Explain the concept of "large language models."
+👶: Please explain the concept of "large language models".
-🤖️: A large language model is an AI technology that uses a large amount of text data to train the model, continuously optimizing it through iterations. Its goal is to generate text based on input data that meets specific needs. This text can be on any topic of interest.
+🤖️: A large language model is an AI technology that uses large amounts of text data to train the model, and then continuously optimizes the model through iterations. Its goal is to generate text based on the input data to meet specific needs, and the text can be about any topic you are interested in.
 👶: Introduce the history of the United States, please.
-🤖️: The term "Introok's the believeations of theument." originates from the ancient Chinese "groty of of the change."
+🤖️: You mentioned "Introok's the believeations of theument." This name originates from the ancient Chinese "groty of of the change."
 ```
-Fast and effective, it can even be further optimized to reduce and improve training data quality.
+<span style="color:rgb(0,147,64)">
-
+Fast and effective, it is still possible to further compress the training process by obtaining smaller and higher-quality datasets.
-Zero model weights are saved as `full_sft_512_zero.pth` (see the MiniMind model file links below), feel free to download
+</span>
-and test this model’s performance.
+The Zero model weights are saved as `full_sft_512_zero.pth` (see the MiniMind model file link below). Feel free to download and test the model's performance.
 ## Ⅱ Main Training Steps
@ -677,8 +713,10 @@ torchrun --nproc_per_node 1 train_pretrain.py # 1 represents single-card trainin
 python train_pretrain.py
 ```
-The trained model weights are saved every `100 steps` by default as: `pretrain_*.pth` (the * represents the specific
+<span style="color:rgb(128,128,128)">
 > The trained model weights are saved every `100 steps` by default as: `pretrain_*.pth` (the * represents the specific
 model dimension, and each new save will overwrite the previous one).
 </span>
 ### **2. Supervised Fine-Tuning (SFT)**:
@ -703,8 +741,10 @@ torchrun --nproc_per_node 1 train_full_sft.py
 python train_full_sft.py
 ```
-The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
+<span style="color:rgb(128,128,128)">
 > The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
 model dimension, and each new save will overwrite the previous one).
 </span>
 ## Ⅲ Other Training Steps
@ -732,8 +772,10 @@ torchrun --nproc_per_node 1 train_dpo.py
 python train_dpo.py
 ```
-The trained model weights are saved every `100 steps` by default as: `rlhf_*.pth` (the * represents the specific model
+<span style="color:rgb(128,128,128)">
 > The trained model weights are saved every `100 steps` by default as: `rlhf_*.pth` (the * represents the specific model
 dimension, and each new save will overwrite the previous one).
 </span>
 ### **4. Knowledge Distillation (KD)**
@ -768,8 +810,10 @@ torchrun --nproc_per_node 1 train_full_sft.py
 python train_full_sft.py
 ```
-The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
+<span style="color:rgb(128,128,128)">
 > The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
 model dimension, and each new save will overwrite the previous one).
 </span>
 This section emphasizes MiniMind’s white-box distillation code `train_distillation.py`. Since MiniMind doesn’t have a
 powerful teacher model within the same series, the white-box distillation code serves as a learning reference.
@ -796,8 +840,10 @@ torchrun --nproc_per_node 1 train_lora.py
 python train_lora.py
 ```
-The trained model weights are saved every `100 steps` by default as: `lora_xxx_*.pth` (the * represents the specific
+<span style="color:rgb(128,128,128)">
 > The trained model weights are saved every `100 steps` by default as: `lora_xxx_*.pth` (the * represents the specific
 model dimension, and each new save will overwrite the previous one).
 </span>
 Many people are puzzled: how can a model learn private domain knowledge? How should datasets be prepared? How to
 transfer general models into specialized domain models?  
@ -918,8 +964,10 @@ torchrun --nproc_per_node 1 train_distill_reason.py
 python train_distill_reason.py
 ```
-The trained model weights are saved every `100 steps` by default as: `reason_*.pth` (* being the specific dimension of
+<span style="color:rgb(128,128,128)">
 > The trained model weights are saved every `100 steps` by default as: `reason_*.pth` (* being the specific dimension of
 the model; each time a new file is saved, it will overwrite the old one).
 </span>
 Test it:
@ -993,31 +1041,28 @@ For reference, the parameter settings for GPT-3 are shown in the table below:
 | MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> |
 | MiniMind2       | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> |
-### 【Training Complete】Model CheckPoint Collection
+### Training Completed - Model Collection
-> Considering feedback on slow speeds on Baidu NetDisk, MiniMind2 and subsequent models will be hosted on
+<span style="color:rgb(128,128,128)">
-> ModelScope/HuggingFace.
+> Considering that many people have reported slow speeds with Baidu Cloud, all MiniMind2 models and beyond will be hosted on ModelScope/HuggingFace.
 </span>
-#### PyTorch Native Models
+#### Native PyTorch Models
 * [MiniMind2 Series (ModelScope)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
 * [MiniMind-V1 Series (Baidu NetDisk)](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)
-#### Transformers Models
+* [MiniMind-V1 Series (Baidu Cloud)](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)
-* MiniMind
+<details style="color:rgb(128,128,128)">
-  Series [(ModelScope)](https://www.modelscope.cn/profile/gongjy) | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
+<summary>Torch File Naming Reference</summary>
-<details>
+| Model Name      | params | pretrain_model         | sft_model              | rl_model           | reason_model     | lora_model         |
 <summary>Torch File Naming Correspondence (Expand)</summary>
 | Model Name      | Params | Pretrain Model         | SFT Model              | RL Model           | Reason Model     | Lora Model         |
 |-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------|
-| MiniMind2-Small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | -                | `lora_xxx_512.pth` |
+| MiniMind2-small | 26M    | `pretrain_512.pth`     | `full_sft_512.pth`     | `rlhf_512.pth`     | `reason_512.pth` | `lora_xxx_512.pth` |
 | MiniMind2-MoE   | 145M   | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | -                | -                  |
 | MiniMind2       | 104M   | `pretrain_768.pth`     | `full_sft_768.pth`     | `rlhf_768.pth`     | `reason_768.pth` | `lora_xxx_768.pth` |
-| Model Name        | Params | Pretrain Model         | Single-turn SFT                    | Multi-turn SFT                    | RL Model     |
+| Model Name        | params | pretrain_model         | Single-turn Chat sft               | Multi-turn Chat sft               | rl_model     |
 |-------------------|--------|------------------------|------------------------------------|-----------------------------------|--------------|
 | minimind-v1-small | 26M    | `pretrain_512.pth`     | `single_chat/full_sft_512.pth`     | `multi_chat/full_sft_512.pth`     | `rl_512.pth` |
 | minimind-v1-moe   | 4×26M  | `pretrain_512_moe.pth` | `single_chat/full_sft_512_moe.pth` | `multi_chat/full_sft_512_moe.pth` | -            |
@ -1025,19 +1070,25 @@ For reference, the parameter settings for GPT-3 are shown in the table below:
 </details>
 #### Transformers Models
 * MiniMind
  Series ([ModelScope](https://www.modelscope.cn/profile/gongjy) | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5))
 ---
 # 📌 Eval
 ## Ⅰ RLHF vs SFT Comparison
-> [!NOTE]
+The test is based on comparing `MiniMind2 series` models with "full_sft" and "rlhf". The random seed is fixed.
 > The test is based on comparing `MiniMind2 series` models with "full_sft" and "rlhf". The random seed is fixed at `42`.
-<details>
+<details style="color:rgb(128,128,128)">
-  <summary>Q&A Test (Expand)</summary>
+<summary>Q&A Test</summary>
-[A0]: full_sft_640, [A1]: rlhf_640
+[A0] MiniMind2：full_sft_640
 [A1] MiniMind2：rlhf_640
  ```text
  [Q]: What is your name?
@ -1090,9 +1141,10 @@ rather than using the PPO method where the reward model acts as a "coach" to cor
 ## Ⅱ Subjective Sample Evaluation
-> [!NOTE]
+<span style="color:rgb(128,128,128)">
-> The following tests were completed on February 9, 2025. New models released after this date will not be included in
+🏃The following tests were completed on February 9, 2025. New models released after this date will not be included in the tests unless there is a special need.
-> the tests unless there is a special need.
+</span>
 [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
@ -1176,13 +1228,16 @@ rather than using the PPO method where the reward model acts as a "coach" to cor
 [H]: I am a small-parameter AI model 🥺, with limited knowledge, unable to directly answer your question. Try another question 👋
 ```
-> [!NOTE]
+---
-> 🙋‍♂️Directly give the model's answer to DeepSeek-R1 for scoring:
+
 <span style="color:rgb(128,128,128)">
 🙋‍Directly give all the questions and the model's answers above to DeepSeek-R1, let it help comment and rank with scores:
 </span>
 ---
-<details>
+<details style="color:rgb(128,128,128)">
-  <summary>Review (Expand)</summary>
+<summary>Specific comments</summary>
 ### Scoring Criteria:
@ -1268,7 +1323,9 @@ rather than using the PPO method where the reward model acts as a "coach" to cor
 ### 👉 Subjective Effect Summary
-> 🙋‍♂️My personal evaluation aligns with DeepSeek-R1's results
+<span style="color:rgb(229,111,8)">
 > My personal evaluation aligns with DeepSeek-R1's results，and：
 </span>
 * The ranking of the MiniMind series is very intuitive. The larger the parameters and the more training data, the higher
  the score, and hallucinations and errors are less noticeable than with smaller models.
@ -1284,8 +1341,8 @@ Chinese models.
 Instead, we'll focus on a selection of <1B micro-models for a comparative analysis.
 The test sets chosen include C-Eval, CMMLU, A-CLUE, and TMMLU+, which are pure Chinese language leaderboards.
-<details>
+<details style="color:rgb(128,128,128)">
-<summary>Evaluation Framework (Click to Expand)</summary>
+<summary>Evaluation Framework</summary>
 The evaluation framework chosen is [lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness),
 which is very easy to set up and run after installation:
--- a/images/logo2.png
+++ b/images/logo2.png