update readme

This commit is contained in:
gongjy 2025-02-10 11:14:52 +08:00
parent b60fd11f85
commit c2e67c25f7
3 changed files with 391 additions and 255 deletions

265
README.md
View File

@ -34,7 +34,8 @@
* 这不仅是大语言模型的全阶段开源复现也是一个入门LLM的教程。 * 这不仅是大语言模型的全阶段开源复现也是一个入门LLM的教程。
* 希望此项目能为所有人提供一个抛砖引玉的示例一起感受创造的乐趣推动更广泛AI社区的进步 * 希望此项目能为所有人提供一个抛砖引玉的示例一起感受创造的乐趣推动更广泛AI社区的进步
> 为防止误解“2小时” 基于NVIDIA 3090硬件设备单卡测试“3块钱” 指GPU服务器租用成本具体规格详情见下文。 > 为防止误解“2小时” 基于NVIDIA 3090硬件设备单卡测试“3块钱”
> 指GPU服务器租用成本具体规格详情见下文。
--- ---
@ -72,7 +73,7 @@
> [!NOTE] > [!NOTE]
> 截至2025-02-07MiniMind系列已完成多个型号模型的预训练最小仅需25.8M0.02B),即可具备流畅对话能力! > 截至2025-02-07MiniMind系列已完成多个型号模型的预训练最小仅需25.8M0.02B),即可具备流畅对话能力!
<details> <details style="color:rgb(128,128,128)">
<summary>Models List</summary> <summary>Models List</summary>
| 模型 (大小) | 推理占用 (约) | Release | | 模型 (大小) | 推理占用 (约) | Release |
@ -172,19 +173,29 @@
# 📌 快速开始 # 📌 快速开始
<div align="center" style="font-size: 1.5em; font-weight: bold;"> ---
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
Hugging Face
[MiniMind (HuggingFace)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
<img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
[MiniMind (ModelScope)](https://www.modelscope.cn/profile/gongjy)
<div align="center" style="font-size: 1.5em; font-weight: bold; display: flex; align-items: center;">
<a href="https://jingyaogong.github.io/minimind" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
<img src="./images/logo2.png" alt="MiniMind Logo" style="vertical-align: middle; height: 30px;" />
<span style="margin-left: 10px;color:#0877e5;font-style: italic">MiniMind Series</span>
</a>
<span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">×</span>
&nbsp;
<a href="https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
<span style="margin-left: 10px;color:#d56100;font-style: italic">Hugging Face</span>
</a>
<span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">&</span>
<a href="https://www.modelscope.cn/profile/gongjy" style="text-decoration: none; display: flex; align-items: center; margin-left: 20px;">
<img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="ModelScope Logo" style="vertical-align: middle; height: 30px;" />
</a>
</div> </div>
<details> ---
<details style="color:rgb(128,128,128)">
<summary>分享本人的软硬件配置(仅供参考)</summary> <summary>分享本人的软硬件配置(仅供参考)</summary>
* CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz * CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
@ -222,9 +233,8 @@ python eval_model.py --load 1
### 3.或启动WebUI ### 3.或启动WebUI
> 「注意」可能需要`python>=3.10` 安装 `pip install streamlit`
```bash ```bash
# 可能需要`python>=3.10` 安装 `pip install streamlit`
# cd scripts # cd scripts
streamlit run web_demo.py streamlit run web_demo.py
``` ```
@ -233,63 +243,88 @@ streamlit run web_demo.py
### 1.环境准备 ### 1.环境准备
```bash ```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
<details> <details style="color:rgb(128,128,128)">
<summary>测试Torch是否可用cuda(展开)</summary> <summary>注:提前测试Torch是否可用cuda</summary>
```bash ```bash
import torch import torch
print(torch.cuda.is_available()) print(torch.cuda.is_available())
``` ```
如果不可用,请自行去[torch_stable](https://download.pytorch.org/whl/torch_stable.html) 如果不可用,请自行去[torch_stable](https://download.pytorch.org/whl/torch_stable.html)
下载whl文件安装。参考[链接](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187) 下载whl文件安装。参考[链接](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
</details> </details>
### 2.数据下载 ### 2.数据下载
从下文的[数据集下载链接](#数据集下载)下载需要的数据文件放到`./dataset`目录下 从下文提供的[数据集下载链接](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files)
下载需要的数据文件放到`./dataset`目录下
> 【注】默认推荐下载`pretrain_hq.jsonl` + `sft_mini_512.jsonl`最快速度复现Zero聊天模型。 <details style="color:rgb(128,128,128)">
<summary>注:数据集须知</summary>
> 【注】数据文件可自由选择下文提供了多种搭配方案可根据自己手头的训练需求和GPU资源进行适当组合。 默认推荐下载`pretrain_hq.jsonl` + `sft_mini_512.jsonl`最快速度复现Zero聊天模型。
数据文件可自由选择下文提供了多种搭配方案可根据自己手头的训练需求和GPU资源进行适当组合。
</details>
### 3.开始训练 ### 3.开始训练
**预训练(学知识)** **3.1 预训练(学知识)**
```bash ```bash
python train_pretrain.py python train_pretrain.py
``` ```
执行预训练,得到 `pretrain_*.pth` 作为预训练的输出权重(其中*为模型的dimension默认为512 <span style="color:rgb(128,128,128)">
> 执行预训练,得到 `pretrain_*.pth` 作为预训练的输出权重(其中*为模型的dimension默认为512
</span>
**监督微调(学对话方式)**
```bash **3.2 监督微调(学对话方式)**
python train_full_sft.py
```
执行监督微调,得到 `full_sft_*.pth` 作为指令微调的输出权重(其中`full`即为全参数微调) ```bash
python train_full_sft.py
```
> 【注】所有训练过程默认每隔100步保存1次参数到文件`./out/***.pth`(每次会覆盖掉旧权重文件)。 <span style="color:rgb(128,128,128);">
> 执行监督微调,得到 `full_sft_*.pth` 作为指令微调的输出权重(其中`full`即为全参数微调)
</span>
> 【注】简单起见,此处只写明两个阶段训练过程。如需其它训练 (LoRA, 蒸馏, 强化学习, 微调推理等) 可参考下文【实验】小节的详细说明。 ---
<details style="color:rgb(128,128,128)">
<summary>注:训练须知</summary>
所有训练过程默认每隔100步保存1次参数到文件`./out/***.pth`(每次会覆盖掉旧权重文件)。
简单起见,此处只写明两个阶段训练过程。如需其它训练 (LoRA, 蒸馏, 强化学习, 微调推理等) 可参考下文【实验】小节的详细说明。
</details>
### 4.测试模型效果 ### 4.测试模型效果
确保需要测试的模型`*.pth`文件位于`./out/`目录下。 确保需要测试的模型`*.pth`文件位于`./out/`目录下。
也可以直接去[此处](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files)下载使用我训练的`*.pth`文件。 也可以直接去[此处](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files)下载使用我训练的`*.pth`文件。
```bash ```bash
python eval_model.py --model_mode 1 # 默认为0测试pretrain模型效果设置为1测试full_sft模型效果 python eval_model.py --model_mode 1 # 默认为0测试pretrain模型效果设置为1测试full_sft模型效果
``` ```
<details style="color:rgb(128,128,128)">
<summary>注:测试须知</summary>
如需详情,查看`eval_model.py`脚本代码即可。model_mode分为 0: 预训练模型1: SFT-Chat模型2: RLHF-Chat模型3: Reason模型
</details>
> 【注】详情查看`eval_model.py`脚本代码即可。model_mode分为 0: 预训练模型1: SFT-Chat模型2: RLHF-Chat模型3: Reason模型
--- ---
@ -302,8 +337,8 @@ streamlit run web_demo.py
torchrun --nproc_per_node 3 train_xxx.py torchrun --nproc_per_node 3 train_xxx.py
``` ```
<details> <details style="color:rgb(128,128,128)">
<summary>其它(展开)</summary> <summary>注:其它须知</summary>
* 单机N卡启动训练 (DeepSpeed) * 单机N卡启动训练 (DeepSpeed)
```bash ```bash
@ -334,8 +369,9 @@ torchrun --nproc_per_node 3 train_xxx.py
五个独立的token且生僻词难以覆盖。 五个独立的token且生僻词难以覆盖。
“词典”的选择固然很重要LLM的输出本质上是SoftMax到词典N个词的多分类问题然后通过“词典”解码到自然语言。 “词典”的选择固然很重要LLM的输出本质上是SoftMax到词典N个词的多分类问题然后通过“词典”解码到自然语言。
因为MiniMind体积需要严格控制为了避免模型头重脚轻词嵌入embedding层参数在LLM占比太高所以词表长度短短益善。 因为MiniMind体积需要严格控制为了避免模型头重脚轻词嵌入embedding层参数在LLM占比太高所以词表长度短短益善。
<details>
<summary>Tokenizer介绍展开</summary> <details style="color:rgb(128,128,128)">
<summary>Tokenizer介绍</summary>
第三方强大的开源模型例如Yi、qwen、chatglm、mistral、Llama3的tokenizer词表长度如下 第三方强大的开源模型例如Yi、qwen、chatglm、mistral、Llama3的tokenizer词表长度如下
@ -451,11 +487,9 @@ quality当然也还不算high提升数据质量无止尽
> [!NOTE] > [!NOTE]
> 2025-02-05后开源MiniMind最终训练所用的所有数据集因此无需再自行预处理大规模数据集避免重复性的数据处理工作。 > 2025-02-05后开源MiniMind最终训练所用的所有数据集因此无需再自行预处理大规模数据集避免重复性的数据处理工作。
MiniMind训练数据集 MiniMind训练数据集 ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong))
[ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files)
| <span style="color:rgb(128,128,128)">> 无需全部clone可单独下载所需的文件</span>
[HuggingFace](https://huggingface.co/datasets/jingyaogong)
无需全部clone可单独下载所需的文件
将下载的数据集文件放到`./dataset/`目录下(✨为推荐的必须项) 将下载的数据集文件放到`./dataset/`目录下(✨为推荐的必须项)
@ -473,8 +507,8 @@ MiniMind训练数据集
└── tokenizer_train.jsonl (1GB) └── tokenizer_train.jsonl (1GB)
``` ```
<details> <details style="color:rgb(128,128,128)">
<summary>各数据集简介(展开)</summary> <summary>注:各数据集简介</summary>
* `dpo.jsonl` --RLHF阶段数据集 * `dpo.jsonl` --RLHF阶段数据集
* `lora_identity.jsonl` --自我认知数据集例如你是谁我是minimind...推荐用于lora训练亦可用于全参SFT勿被名字局限 * `lora_identity.jsonl` --自我认知数据集例如你是谁我是minimind...推荐用于lora训练亦可用于全参SFT勿被名字局限
@ -492,7 +526,8 @@ MiniMind训练数据集
![dataset](./images/dataset.jpg) ![dataset](./images/dataset.jpg)
**说明** <details style="color:rgb(128,128,128)">
<summary>说明 & 推荐训练方案</summary>
* MiniMind2 Series均经过共约20GB语料训练大约4B tokens即对应上面的数据组合训练结果开销💰💰💰💰💰💰💰💰效果😊😊😊😊😊😊 * MiniMind2 Series均经过共约20GB语料训练大约4B tokens即对应上面的数据组合训练结果开销💰💰💰💰💰💰💰💰效果😊😊😊😊😊😊
@ -502,6 +537,8 @@ MiniMind训练数据集
* 【折中方案】亦可选择例如`sft_mini_512.jsonl``sft_1024.jsonl`中等规模数据进行自由组合训练(开销:💰💰💰,效果:😊😊😊😊)。 * 【折中方案】亦可选择例如`sft_mini_512.jsonl``sft_1024.jsonl`中等规模数据进行自由组合训练(开销:💰💰💰,效果:😊😊😊😊)。
</details>
# 📌 Model Structure # 📌 Model Structure
MiniMind-Dense和[Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/)一样使用了Transformer的Decoder-Only结构跟GPT-3的区别在于 MiniMind-Dense和[Llama3.1](https://ai.meta.com/blog/meta-llama-3-1/)一样使用了Transformer的Decoder-Only结构跟GPT-3的区别在于
@ -541,7 +578,7 @@ MiniMind的整体结构一致只是在RoPE计算、推理函数和FFN层的
## 训练开销 ## 训练开销
- **时间单位**:小时 (h)。 - **时间单位**:小时 (h)。
- **成本单位**:人民币 (¥)美元 ($)7¥ ≈ 1$ - **成本单位**:人民币 (¥)7¥ ≈ 1美元
- **3090 租卡单价**≈1.3¥/h可自行参考实时市价 - **3090 租卡单价**≈1.3¥/h可自行参考实时市价
- **参考标准**:表格仅实测 `pretrain``sft_mini_512` 两个数据集的训练时间,其它耗时根据数据集大小估算(可能存在些许出入)。 - **参考标准**:表格仅实测 `pretrain``sft_mini_512` 两个数据集的训练时间,其它耗时根据数据集大小估算(可能存在些许出入)。
@ -554,28 +591,40 @@ MiniMind的整体结构一致只是在RoPE计算、推理函数和FFN层的
--- ---
<details style="color:rgb(128,128,128)">
<summary>训练开销总结&预测</summary>
> MiniMind2-Small参数 > MiniMind2-Small参数
>> `pretrain_hq.jsonl`+`sft_mini_512.jsonl`数据集 >> `pretrain_hq`+`sft_mini_512`数据集
<br/>单卡3090 (1 epoch) + 2.1小时 + 花费2.73元人民币 <br/>单卡3090 (1 epoch) + 2.1小时 + 花费2.73元人民币
<br/>即可从0训练出MiniMind-Zero-0.025B模型!!! <br/>即可从0训练出MiniMind-Zero-0.025B模型!!!
> MiniMind2-Small参数 > MiniMind2-Small参数
>> `pretrain_hq.jsonl`+`sft_512.jsonl`+`sft_2048.jsonl`+`dpo.jsonl`数据集 >> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo`数据集
<br/>单卡3090 (2 epochs) + 大约38.16小时 + 花费49.61元人民币 <br/>单卡3090 (2 epochs) + 大约38.16小时 + 花费49.61元人民币
<br/>即可从0训练出MiniMind2-Small-0.025B模型!!! <br/>即可从0训练出MiniMind2-Small-0.025B模型!!!
> MiniMind2参数 > MiniMind2参数
>> `pretrain_hq.jsonl`+`sft_512.jsonl`+`sft_2048.jsonl`+`dpo.jsonl`数据集 >> `pretrain_hq`+`sft_512`+`sft_2048`+`dpo`数据集
<br/>单卡3090 (2 epochs) + 大约122小时 + 花费158.6元人民币 <br/>单卡3090 (2 epochs) + 大约122小时 + 花费158.6元人民币
<br/>即可从0训练出MiniMind2-0.1B模型!!! <br/>即可从0训练出MiniMind2-0.1B模型!!!
✨基于单卡NVIDIA 3090的`MiniMind-Zero`从0训练仅需`2小时` + `3块钱`实现ChatBot效果 </details>
<span style="color:rgb(0,147,64)">
基于单卡NVIDIA 3090的`MiniMind-Zero`从0训练仅需`2小时` + `3块钱`实现ChatBot效果
</span>
✨PS若采用8卡4090训练总用时甚至可以压缩到10分钟以内由于时间更短花费同样3元左右与单卡成本相当 ✨PS若采用8卡4090训练总用时甚至可以压缩到10分钟以内由于时间更短花费同样3元左右与单卡成本相当
✨以极低极低的门槛实现人人可玩的大模型自由这正是MiniMind系列的诞生初衷 ✨以极低极低的门槛实现人人可玩的大模型自由这正是MiniMind系列的诞生初衷
✨仅价值`3块钱`成本的`MiniMind-Zero`并不只是噱头Chat测试 <span style="color:rgb(0,147,64)">
仅价值`3块钱`成本的`MiniMind-Zero`并不是噱头Chat测试
</span>
```textmate ```textmate
👶: 请介绍一下自己。 👶: 请介绍一下自己。
@ -591,10 +640,12 @@ MiniMind的整体结构一致只是在RoPE计算、推理函数和FFN层的
🤖️: 您提到的“Introok's the believeations of theument." 这个名字来源于中国古代的"groty of of the change." 🤖️: 您提到的“Introok's the believeations of theument." 这个名字来源于中国古代的"groty of of the change."
``` ```
<span style="color:rgb(0,147,64)">
极速且初具效果,甚至仍然可以进一步压缩获取更小更优质的训练数据。 极速且初具效果,甚至仍然可以进一步压缩获取更小更优质的训练数据。
</span>
Zero模型权重保存为 `full_sft_512_zero.pth`见下文MiniMind模型文件链接如有兴趣可下载检验此模型效果。 Zero模型权重保存为 `full_sft_512_zero.pth`见下文MiniMind模型文件链接如有兴趣可下载检验此模型效果。
--- ---
## Ⅱ 主要训练步骤 ## Ⅱ 主要训练步骤
@ -612,7 +663,10 @@ torchrun --nproc_per_node 1 train_pretrain.py # 1即为单卡训练可根据
python train_pretrain.py python train_pretrain.py
``` ```
训练后的模型权重文件默认每隔`100步`保存为: `pretrain_*.pth`*为模型具体dimension每次保存时新文件会覆盖旧文件 <span style="color:rgb(128,128,128)">
> 训练后的模型权重文件默认每隔`100步`保存为: `pretrain_*.pth`*
为模型具体dimension每次保存时新文件会覆盖旧文件
</span>
### **2. 有监督微调(Supervised Fine-Tuning)**: ### **2. 有监督微调(Supervised Fine-Tuning)**:
@ -630,7 +684,8 @@ torchrun --nproc_per_node 1 train_full_sft.py
python train_full_sft.py python train_full_sft.py
``` ```
训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`*为模型具体dimension每次保存时新文件会覆盖旧文件 <span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`*
为模型具体dimension每次保存时新文件会覆盖旧文件</span>
## Ⅲ 其它训练步骤 ## Ⅲ 其它训练步骤
@ -643,7 +698,10 @@ python train_full_sft.py
与PPO(Proximal Policy Optimization)这种需要奖励模型、价值模型的RL算法不同 与PPO(Proximal Policy Optimization)这种需要奖励模型、价值模型的RL算法不同
DPO通过推导PPO奖励模型的显式解把在线奖励模型换成离线数据Ref模型输出可以提前保存。 DPO通过推导PPO奖励模型的显式解把在线奖励模型换成离线数据Ref模型输出可以提前保存。
DPO性能几乎不变只用跑 actor_model 和 ref_model 两个模型,大大节省显存开销和增加训练稳定性。 DPO性能几乎不变只用跑 actor_model 和 ref_model 两个模型,大大节省显存开销和增加训练稳定性。
> 【注】RLHF训练步骤**并非必须**,此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”,有利(符合偏好、减少有害内容)也有弊(样本收集昂贵、反馈偏差、多样性损失)。
<span style="color:rgb(128,128,128)">
RLHF训练步骤**并非必须**,此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”,有利(符合偏好、减少有害内容)也有弊(样本收集昂贵、反馈偏差、多样性损失)。
</span>
```bash ```bash
torchrun --nproc_per_node 1 train_dpo.py torchrun --nproc_per_node 1 train_dpo.py
@ -651,7 +709,8 @@ torchrun --nproc_per_node 1 train_dpo.py
python train_dpo.py python train_dpo.py
``` ```
训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`*为模型具体dimension每次保存时新文件会覆盖旧文件 <span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`*
为模型具体dimension每次保存时新文件会覆盖旧文件</span>
### **4. 知识蒸馏(Knowledge Distillation, KD)** ### **4. 知识蒸馏(Knowledge Distillation, KD)**
@ -678,7 +737,8 @@ torchrun --nproc_per_node 1 train_full_sft.py
python train_full_sft.py python train_full_sft.py
``` ```
训练后的模型权重文件默认每隔`100步`同样保存为: `full_sft_*.pth`*为模型具体dimension每次保存时新文件会覆盖旧文件 <span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`同样保存为: `full_sft_*.pth`*
为模型具体dimension每次保存时新文件会覆盖旧文件</span>
此处应当着重介绍MiniMind实现的白盒蒸馏代码`train_distillation.py`由于MiniMind同系列本身并不存在强大的教师模型因此白盒蒸馏代码仅作为学习参考。 此处应当着重介绍MiniMind实现的白盒蒸馏代码`train_distillation.py`由于MiniMind同系列本身并不存在强大的教师模型因此白盒蒸馏代码仅作为学习参考。
@ -701,7 +761,10 @@ torchrun --nproc_per_node 1 train_lora.py
python train_lora.py python train_lora.py
``` ```
训练后的模型权重文件默认每隔`100步`保存为: `lora_xxx_*.pth`*为模型具体dimension每次保存时新文件会覆盖旧文件 <span style="color:rgb(128,128,128)">
> 训练后的模型权重文件默认每隔`100步`保存为: `lora_xxx_*.pth`*
为模型具体dimension每次保存时新文件会覆盖旧文件
</span>
非常多的人困惑,如何使模型学会自己私有领域的知识?如何准备数据集?如何迁移通用领域模型打造垂域模型? 非常多的人困惑,如何使模型学会自己私有领域的知识?如何准备数据集?如何迁移通用领域模型打造垂域模型?
这里举几个例子,对于通用模型,医学领域知识欠缺,可以尝试在原有模型基础上加入领域知识,以获得更好的性能。 这里举几个例子,对于通用模型,医学领域知识欠缺,可以尝试在原有模型基础上加入领域知识,以获得更好的性能。
@ -801,7 +864,8 @@ torchrun --nproc_per_node 1 train_distill_reason.py
python train_distill_reason.py python train_distill_reason.py
``` ```
训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`*为模型具体dimension每次保存时新文件会覆盖旧文件 <span style="color:rgb(128,128,128)">> 训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`*
为模型具体dimension每次保存时新文件会覆盖旧文件</span>
测试一下: 测试一下:
@ -858,16 +922,20 @@ MobileLLM提出架构的深度比宽度更重要「深而窄」的「瘦长
## 训练结果 ## 训练结果
> `MiniMind2` 模型训练损失走势由于数据集在训练后又更新清洗多次因此Loss仅供参考 <span style="color:rgb(128,128,128)">
MiniMind2 模型训练损失走势由于数据集在训练后又更新清洗多次因此Loss仅供参考
</span>
| models | pretrain (length-512) | sft (length-512) | | models | pretrain (length-512) | sft (length-512) |
|-----------------|----------------------------------------------------|----------------------------------------------------| |-----------------|----------------------------------------------------|----------------------------------------------------|
| MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> | | MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> |
| MiniMind2 | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> | | MiniMind2 | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> |
### 【训练完成】模型CheckPoint合集 ### 训练完成-模型合集
<span style="color:rgb(128,128,128)">
> 考虑到多人反应百度网盘速度慢MiniMind2及以后全部使用ModelScope/HuggingFace托管。 > 考虑到多人反应百度网盘速度慢MiniMind2及以后全部使用ModelScope/HuggingFace托管。
</span>
#### PyTorch原生模型 #### PyTorch原生模型
@ -875,17 +943,12 @@ MobileLLM提出架构的深度比宽度更重要「深而窄」的「瘦长
* [MiniMind-V1系列 (百度网盘)](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666) * [MiniMind-V1系列 (百度网盘)](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)
#### Transformers模型 <details style="color:rgb(128,128,128)">
<summary>Torch文件命名对照</summary>
* MiniMind系列 [(ModelScope)](https://www.modelscope.cn/profile/gongjy)
| [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
<details>
<summary>Torch文件命名对照展开</summary>
| Model Name | params | pretrain_model | sft_model | rl_model | reason_model | lora_model | | Model Name | params | pretrain_model | sft_model | rl_model | reason_model | lora_model |
|-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------| |-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------|
| MiniMind2-small | 26M | `pretrain_512.pth` | `full_sft_512.pth` | `rlhf_512.pth` | - | `lora_xxx_512.pth` | | MiniMind2-small | 26M | `pretrain_512.pth` | `full_sft_512.pth` | `rlhf_512.pth` | `reason_512.pth` | `lora_xxx_512.pth` |
| MiniMind2-MoE | 145M | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | - | - | | MiniMind2-MoE | 145M | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | - | - |
| MiniMind2 | 104M | `pretrain_768.pth` | `full_sft_768.pth` | `rlhf_768.pth` | `reason_768.pth` | `lora_xxx_768.pth` | | MiniMind2 | 104M | `pretrain_768.pth` | `full_sft_768.pth` | `rlhf_768.pth` | `reason_768.pth` | `lora_xxx_768.pth` |
@ -897,6 +960,10 @@ MobileLLM提出架构的深度比宽度更重要「深而窄」的「瘦长
</details> </details>
#### Transformers模型
* MiniMind系列 ([ModelScope](https://www.modelscope.cn/profile/gongjy)
| [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5))
--- ---
@ -904,13 +971,14 @@ MobileLLM提出架构的深度比宽度更重要「深而窄」的「瘦长
## RLHF对比SFT篇 ## RLHF对比SFT篇
> [!NOTE] 测试基于`full_sft``rlhf``MiniMind2系列`模型对比,测试随机种子均固定
> 测试基于「full_sft」和「rlhf」的`MiniMind2系列`模型对比,测试随机种子固定为`42`
<details> <details style="color:rgb(128,128,128)">
<summary>问答实测(展开)</summary> <summary>问答实测</summary>
[A0]: full_sft_640, [A1]: rlhf_640 [A0] MiniMind2full_sft_640
[A1] MiniMind2rlhf_640
```text ```text
[Q]: 你叫什么名字? [Q]: 你叫什么名字?
@ -942,8 +1010,10 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的和minimind
## Ⅱ 主观样例测评 ## Ⅱ 主观样例测评
> [!NOTE] <span style="color:rgb(128,128,128)">
> 以下测试于2025-02-09完成此日期后发布的新模型无特殊需要时将不加入测试。 🏃以下测试于2025-02-09完成此日期后发布的新模型无特殊需要时将不加入测试。
</span>
[A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/> [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
[B] [MiniMind2-MoE (0.15B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/> [B] [MiniMind2-MoE (0.15B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
@ -1032,13 +1102,17 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的和minimind
[H]: 我是一个参数很少的AI模型🥺知识库较少无法直接回答您的问题换个问题试试吧👋 [H]: 我是一个参数很少的AI模型🥺知识库较少无法直接回答您的问题换个问题试试吧👋
``` ```
> [!NOTE] ---
> 🙋直接把上述模型的回答丢给DeepSeek-R1让它帮忙打个分
<span style="color:rgb(128,128,128)">
🙋直接把以上所有问题和模型的回答丢给DeepSeek-R1让它帮忙点评和排名打分
</span>
--- ---
<details>
<summary>点评(展开)</summary> <details style="color:rgb(128,128,128)">
<summary>具体点评</summary>
### 评分标准: ### 评分标准:
@ -1116,11 +1190,16 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的和minimind
### 👉主观效果总结 ### 👉主观效果总结
> 🙋个人主观评价与DeepSeek-R1相符 <span style="color:rgb(229,111,8)">
个人主观评价与DeepSeek-R1基本相符其中
</span>
* MiniMind系列的排序非常符合直觉参数越大+训练数据越充分评分越高,幻觉和错误都会比小模型肉眼可见的好。 * MiniMind系列的排序非常符合直觉参数越大+训练数据越充分评分越高,幻觉和错误都会比小模型肉眼可见的好。
* H模型的回答肉眼看起来是不错的尽管存在些许幻觉瞎编的情况。 * H模型的回答肉眼看起来是不错的尽管存在些许幻觉瞎编的情况。
* G模型可能训练数据不够完备给出的权重经过测试效果不佳。 * G模型可能训练数据不够完备给出的权重经过测试效果不佳。
* 再复诵一遍经久不衰的Scaling Law: 参数越大,训练数据越多模型的性能越强。 * 再复诵一遍经久不衰的Scaling Law: 参数越大,训练数据越多模型的性能越强。
## Ⅲ Objective Benchmark ## Ⅲ Objective Benchmark
@ -1130,8 +1209,8 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的和minimind
测试集选择C-Eval、CMMLU、A-CLUE、TMMLU+这几个纯中文语言榜单。 测试集选择C-Eval、CMMLU、A-CLUE、TMMLU+这几个纯中文语言榜单。
<details> <details style="color:rgb(128,128,128)">
<summary>测评框架(展开)</summary> <summary>测评框架</summary>
测评框架选择[lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness) 测评框架选择[lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness)
安装后启动测试非常方便: 安装后启动测试非常方便:

View File

@ -58,35 +58,28 @@
# 📌 Introduction # 📌 Introduction
The emergence of Large Language Models (LLMs) has sparked unprecedented global attention on AI. The emergence of Large Language Models (LLMs) has sparked unprecedented global attention on AI. Whether it's ChatGPT,
Whether it's ChatGPT, DeepSeek, or Qwen, their stunning performance leaves people in awe. DeepSeek, or Qwen, their stunning performance leaves people in awe. However, the massive scale of hundreds of billions
However, the massive scale of hundreds of billions of parameters makes it not only difficult to train them on personal of parameters makes it not only difficult to train them on personal devices, but also almost impossible to deploy them.
devices, but also almost impossible to deploy them. Opening the "black box" of large models and exploring their inner workings is exhilarating! Sadly, 99% of explorations
Opening the "black box" of large models and exploring their inner workings is exhilarating! can only stop at fine-tuning existing large models with techniques like LoRA, learning a few new commands or tasks. It's
Sadly, 99% of explorations can only stop at fine-tuning existing large models with techniques like LoRA, learning a few like teaching Newton how to use a 21st-century smartphone—though interesting, it completely deviates from the original
new commands or tasks. goal of understanding the essence of physics. Meanwhile, third-party large model frameworks and toolkits, such as
It's like teaching Newton how to use a 21st-century smartphone—though interesting, it completely deviates from the transformers+trl, almost only expose highly abstract interfaces. With just 10 lines of code, you can complete the entire
original goal of understanding the essence of physics. training process of "loading model + loading dataset + inference + reinforcement learning". While this efficient
Meanwhile, third-party large model frameworks and toolkits, such as transformers+trl, almost only expose highly abstract encapsulation is convenient, it's like a high-speed spaceship, isolating us from the underlying implementation and
interfaces. hindering our opportunity to dive deep into the core code of LLMs. However, "building a plane with Legos is far more
With just 10 lines of code, you can complete the entire training process of "loading model + loading dataset + exciting than flying first-class!" What's worse, the internet is flooded with paid courses and marketing accounts,
inference + reinforcement learning". selling AI tutorials with flawed and half-understood content. Therefore, the goal of this project is to lower the
While this efficient encapsulation is convenient, it's like a high-speed spaceship, isolating us from the underlying learning threshold for LLMs, allowing everyone to start by understanding each line of code, and to train a very small
implementation and hindering our opportunity to dive deep into the core code of LLMs. language model from scratch, not just performing **inference**! With server costs of less than 3 RMB, you can experience
However, "building a plane with Legos is far more exciting than flying first-class!" the entire process of building a language model from 0 to 1. Let's enjoy the fun of creation together!
What's worse, the internet is flooded with paid courses and marketing accounts, selling AI tutorials with flawed and
half-understood content.
Therefore, the goal of this project is to lower the learning threshold for LLMs, allowing everyone to start by
understanding each line of code,
and to train a very small language model from scratch, not just performing **inference**!
With server costs of less than 3 RMB, you can experience the entire process of building a language model from 0 to 1.
Let's enjoy the fun of creation together!
> [!NOTE] > [!NOTE]
> (As of 2025-02-07) The MiniMind series has completed pretraining for multiple models, with the smallest one being only > (As of 2025-02-07) The MiniMind series has completed pretraining for multiple models, with the smallest one being only
> 25.8M (0.02B) and capable of smooth conversation! > 25.8M (0.02B) and capable of smooth conversation!
<details> <details style="color:rgb(128,128,128)">
<summary>Models List</summary> <summary>Models List</summary>
| Model (Size) | Inference Usage (Approx.) | Release | | Model (Size) | Inference Usage (Approx.) | Release |
@ -120,7 +113,7 @@ Let's enjoy the fun of creation together!
We hope this open-source project can help LLM beginners quickly get started! We hope this open-source project can help LLM beginners quickly get started!
### 👉**Changelog** ### 👉**Update log**
<details close> <details close>
<summary> <b>2025-02-09 (newest 🎉🎉🎉)</b> </summary> <summary> <b>2025-02-09 (newest 🎉🎉🎉)</b> </summary>
@ -191,19 +184,28 @@ We hope this open-source project can help LLM beginners quickly get started!
# 📌 Quick Start # 📌 Quick Start
<div align="center" style="font-size: 1.5em; font-weight: bold;"> ---
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
Hugging Face
[MiniMind (HuggingFace)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
<img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
[MiniMind (ModelScope)](https://www.modelscope.cn/profile/gongjy)
<div align="center" style="font-size: 1.5em; font-weight: bold; display: flex; align-items: center;">
<a href="https://jingyaogong.github.io/minimind" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
<img src="./images/logo2.png" alt="MiniMind Logo" style="vertical-align: middle; height: 30px;" />
<span style="margin-left: 10px;color:#0877e5;font-style: italic">MiniMind Series</span>
</a>
<span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">×</span>
&nbsp;
<a href="https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5" style="text-decoration: none; display: flex; align-items: center; margin-right: 20px;">
<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo" style="vertical-align: middle; height: 30px;" />
<span style="margin-left: 10px;color:#d56100;font-style: italic">Hugging Face</span>
</a>
<span style="color: #aaa; font-size: 1.2em; margin: 0 10px;">&</span>
<a href="https://www.modelscope.cn/profile/gongjy" style="text-decoration: none; display: flex; align-items: center; margin-left: 20px;">
<img src="https://g.alicdn.com/sail-web/maas/1.15.0/static/modelscopeIcon.cd89353f.svg" alt="ModelScope Logo" style="vertical-align: middle; height: 30px;" />
</a>
</div> </div>
<details> ---
<details style="color:rgb(128,128,128)">
<summary>Sharing My Hardware and Software Configuration (For Reference Only)</summary> <summary>Sharing My Hardware and Software Configuration (For Reference Only)</summary>
* CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz * CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
@ -241,84 +243,105 @@ python eval_model.py --load 1
### 3. Or Start WebUI ### 3. Or Start WebUI
> **Note**: You may need `python>=3.10` and install `pip install streamlit`.
```bash ```bash
# You may need `python>=3.10` and install `pip install streamlit`.
# cd scripts # cd scripts
streamlit run web_demo.py streamlit run web_demo.py
``` ```
## Ⅱ Train From Scratch ## Ⅱ Training from Scratch
### 1. Environment Setup ### 1. Environment Setup
```bash ```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
<details> <details style="color:rgb(128,128,128)">
<summary>Test if Torch Can Use CUDA (Expand)</summary> <summary>Note: Test if Torch can use CUDA</summary>
```bash ```bash
import torch import torch
print(torch.cuda.is_available()) print(torch.cuda.is_available())
``` ```
If CUDA is unavailable, please manually download the `.whl` file If CUDA is not available, please download the `.whl` file
from [torch_stable](https://download.pytorch.org/whl/torch_stable.html) and install it. See from [torch_stable](https://download.pytorch.org/whl/torch_stable.html) and install it. Refer to
the [link](https://blog.csdn.net/weixin_45456738/article/details/141029610) for instructions. this [link](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=%E5%AE%89%E8%A3%85torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
for guidance.
</details> </details>
### 2. Data Download ### 2. Data Download
Download the necessary data files from the [Dataset Download Link](#数据集下载) and place them in the `./dataset` Download the required data files from
directory. the [dataset download link](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) and place them in
the `./dataset` directory.
> **Note**: The recommended fastest way to replicate the Zero Chat model is by <details style="color:rgb(128,128,128)">
> downloading `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. <summary>Note: Dataset Information</summary>
> **Note**: You can freely choose from multiple dataset options below based on your training needs and GPU resources. By default, it is recommended to download `pretrain_hq.jsonl` + `sft_mini_512.jsonl` for the fastest Zero-chat model
reproduction.
You can freely choose data files. Various combinations are provided below, and you can select according to your training
needs and GPU resources.
</details>
### 3. Start Training ### 3. Start Training
**Pretraining (Learning Knowledge)** **3.1 Pretraining (Learning Knowledge)**
```bash ```bash
python train_pretrain.py python train_pretrain.py
``` ```
Run pretraining to get `pretrain_*.pth` as the output weight file (where * represents the model dimension, default is <span style="color:rgb(128,128,128)">
512). > Execute pretraining to get `pretrain_*.pth` as the output weights for pretraining (where * represents the model dimension, default is 512).
</span>
**Supervised Fine-tuning (Learning Dialogue Style)** **3.2 Supervised Fine-Tuning (Learning Dialogue Style)**
```bash ```bash
python train_full_sft.py python train_full_sft.py
``` ```
Run supervised fine-tuning to get `full_sft_*.pth` as the output weight file (where `full` represents full parameter <span style="color:rgb(128,128,128);">
fine-tuning). > Execute supervised fine-tuning to get `full_sft_*.pth` as the output weights for instruction fine-tuning (where `full` represents full parameter fine-tuning).
</span>
> **Note**: During training, every 100 steps, the model parameters will be saved to the `./out/***.pth` file (each time ---
> overwriting the old weight file).
> **Note**: For simplicity, only two stages of training are mentioned here. For other training processes (LoRA, <details style="color:rgb(128,128,128)">
> distillation, reinforcement learning, fine-tuning inference, etc.), please refer to the **Experiments** section for <summary>Note: Training Information</summary>
> detailed instructions.
### 4. Test the Model By default, during training, the model parameters are saved every 100 steps to `./out/***.pth` (each time overwriting
the old weight file).
Ensure the model `*.pth` file is in the `./out/` directory. For simplicity, only the two training stages are listed here. For other training methods (LoRA, distillation,
You can also directly download my pre-trained `*.pth` files reinforcement learning, fine-tuning inference, etc.), refer to the detailed explanation in the [Experiments] section
below.
</details>
### 4. Testing Model Performance
Ensure that the model `*.pth` file you want to test is located in the `./out/` directory.
Alternatively, you can download and use the `*.pth` files I trained
from [here](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files). from [here](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch/files).
```bash ```bash
python eval_model.py --model_mode 1 # Default is 0: test pretrain model, set to 1: test full_sft model python eval_model.py --model_mode 1 # Default is 0: Test pretrain model, set to 1: Test full_sft model
``` ```
> **Note**: For more details, refer to the `eval_model.py` script. `model_mode` options are: 0: Pretrained Model, 1: <details style="color:rgb(128,128,128)">
> SFT-Chat Model, 2: RLHF-Chat Model, 3: Reasoning Model. <summary>Note: Testing Information</summary>
For more details, you can check the `eval_model.py` script code. The model_mode options are 0: Pretraining model, 1:
SFT-Chat model, 2: RLHF-Chat model, 3: Reason model.
</details>
--- ---
@ -332,8 +355,8 @@ Start training with N GPUs on a single machine (DDP, supports multi-node, multi-
torchrun --nproc_per_node 3 train_xxx.py torchrun --nproc_per_node 3 train_xxx.py
``` ```
<details> <details style="color:rgb(128,128,128)">
<summary>Others (Expand)</summary> <summary>Note: Others</summary>
* Start training with N GPUs on a single machine (DeepSpeed): * Start training with N GPUs on a single machine (DeepSpeed):
```bash ```bash
@ -373,8 +396,8 @@ vocabulary, with the model decoding the final output into natural language. Sinc
strictly controlled, the vocabulary length should be kept short to avoid the embedding layer dominating the model's strictly controlled, the vocabulary length should be kept short to avoid the embedding layer dominating the model's
overall parameters. Thus, a smaller vocabulary size is beneficial. overall parameters. Thus, a smaller vocabulary size is beneficial.
<details> <details style="color:rgb(128,128,128)">
<summary>Tokenizer Details (Expand)</summary> <summary>Tokenizer Details</summary>
Here are the vocabulary sizes of several popular open-source models: Here are the vocabulary sizes of several popular open-source models:
@ -518,8 +541,8 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ required f
└── tokenizer_train.jsonl (1GB) └── tokenizer_train.jsonl (1GB)
``` ```
<details> <details style="color:rgb(128,128,128)">
<summary>Dataset Descriptions (Expand)</summary> <summary>Dataset Descriptions</summary>
* `dpo.jsonl` -- RLHF dataset * `dpo.jsonl` -- RLHF dataset
* `lora_identity.jsonl` -- Self-identity dataset (e.g., "Who are you? I'm MiniMind..."), recommended for LoRA training ( * `lora_identity.jsonl` -- Self-identity dataset (e.g., "Who are you? I'm MiniMind..."), recommended for LoRA training (
@ -538,21 +561,24 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ required f
![dataset](./images/dataset.jpg) ![dataset](./images/dataset.jpg)
**Explanation** <details style="color:rgb(128,128,128)">
<summary>Explanation & Recommended Training Plans</summary>
* MiniMind2 Series has been trained on approximately 20GB of corpus, around 4B tokens, corresponding to the above data * The MiniMind2 Series has been trained on approximately 20GB of corpus, or about 4B tokens, corresponding to the data
combination training results (Cost: 💰💰💰💰💰💰💰💰, Effect: 😊😊😊😊😊😊) combination results above (Cost: 💰💰💰💰💰💰💰💰, Effect: 😊😊😊😊😊😊).
* For the fastest way to implement the Zero model from scratch, it is recommended to use the data combination * For the fastest Zero-model implementation from scratch, it is recommended to use the data combination
of `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. Specific costs and effects can be seen in the table below (Cost: 💰, of `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. The specific costs and effects can be seen in the table below (Cost:
Effect: 😊😊) 💰, Effect: 😊😊).
* For those with sufficient computational resources or who care more about performance, it is recommended to fully * For those with sufficient computational resources or more focus on results, it is advisable to fully reproduce
reproduce MiniMind2. If you only have a single GPU card or need to reproduce quickly in a short time, the latter is MiniMind2 with the first option; if you only have a single GPU or prefer a quick reproduction within a short time, the
strongly recommended. second option is strongly recommended.
* [Compromise solution] You can also choose medium-sized data such as `sft_mini_512.jsonl` and `sft_1024.jsonl` for * [Compromise Plan] You can also freely combine medium-sized data like `sft_mini_512.jsonl`, `sft_1024.jsonl` for
flexible combinations (Cost: 💰💰💰, Effect: 😊😊😊😊). training (Cost: 💰💰💰, Effect: 😊😊😊😊).
</details>
# 📌 Model Structure # 📌 Model Structure
@ -599,7 +625,7 @@ Reference model parameter versions are shown in the table below:
## Training Cost ## Training Cost
- **Time Unit**: Hours (h). - **Time Unit**: Hours (h).
- **Cost Unit**: RMB (¥); USD ($); 7¥ ≈ 1$. - **Cost Unit**: RMB (¥); 7¥ ≈ 1 USD.
- **3090 Rental Unit Price**: ≈ 1.3¥/h (subject to real-time market rates). - **3090 Rental Unit Price**: ≈ 1.3¥/h (subject to real-time market rates).
- **Reference Standard**: The table only shows the actual training time for the `pretrain` and `sft_mini_512` datasets. - **Reference Standard**: The table only shows the actual training time for the `pretrain` and `sft_mini_512` datasets.
Other times are estimated based on dataset size (there may be some discrepancies). Other times are estimated based on dataset size (there may be some discrepancies).
@ -613,50 +639,60 @@ Reference model parameter versions are shown in the table below:
--- ---
> MiniMind2-Small Parameters <details style="color:rgb(128,128,128)">
>> `pretrain_hq.jsonl` + `sft_mini_512.jsonl` dataset <summary>Training Cost Summary & Prediction</summary>
<br/>Single card 3090 (1 epoch) + 2.1 hours + cost 2.73 RMB
<br/>Can train the MiniMind-Zero-0.025B model from scratch!!!
> MiniMind2-Small Parameters > MiniMind2-Small Parameters
>> `pretrain_hq.jsonl` + `sft_512.jsonl` + `sft_2048.jsonl` + `dpo.jsonl` dataset >> `pretrain_hq` + `sft_mini_512` dataset
<br/>Single card 3090 (2 epochs) + approx. 38.16 hours + cost 49.61 RMB <br/>Single 3090 GPU (1 epoch) + 2.1 hours + Cost: 2.73 RMB
<br/>Can train the MiniMind2-Small-0.025B model from scratch!!! <br/>You can train the MiniMind-Zero-0.025B model from scratch!!!
> MiniMind2-Small Parameters
>> `pretrain_hq` + `sft_512` + `sft_2048` + `dpo` dataset
<br/>Single 3090 GPU (2 epochs) + approximately 38.16 hours + Cost: 49.61 RMB
<br/>You can train the MiniMind2-Small-0.025B model from scratch!!!
> MiniMind2 Parameters > MiniMind2 Parameters
>> `pretrain_hq.jsonl` + `sft_512.jsonl` + `sft_2048.jsonl` + `dpo.jsonl` dataset >> `pretrain_hq` + `sft_512` + `sft_2048` + `dpo` dataset
<br/>Single card 3090 (2 epochs) + approx. 122 hours + cost 158.6 RMB <br/>Single 3090 GPU (2 epochs) + approximately 122 hours + Cost: 158.6 RMB
<br/>Can train the MiniMind2-0.1B model from scratch!!! <br/>You can train the MiniMind2-0.1B model from scratch!!!
✨With a single card NVIDIA 3090, training `MiniMind-Zero` from scratch takes only `2 hours` + `3 RMB`, achieving ChatBot </details>
functionality!
✨PS: If 8 cards of 4090 are used, total training time can be compressed to under 10 minutes! (Due to the shorter time,
the cost remains around 3 RMB, similar to the single card cost)
✨Achieving large model accessibility with an extremely low entry barrier! This is the core intention behind the creation <span style="color:rgb(0,147,64)">
of the MiniMind series! With a single NVIDIA 3090 GPU, you can train `MiniMind-Zero` from scratch in just `2 hours` and for a cost of
only `3 RMB`, achieving ChatBot functionality!
</span>
✨The `MiniMind-Zero`, costing only `3 RMB`, is not just a gimmick! Chat test: ✨PS: If training on 8 GPUs with 4090s, the total time can be compressed to under 10 minutes! (Despite the shorter time,
the cost is still around 3 RMB, which is comparable to the single GPU cost.)
✨This enables ultra-low barriers to entry, making it possible for everyone to experiment with large models! This is the
original purpose behind the creation of the MiniMind series!
<span style="color:rgb(0,147,64)">
The `MiniMind-Zero` model, which costs only `3 RMB`, is not a gimmick! Chat test results:
</span>
```textmate ```textmate
👶: Introduce yourself. 👶: Please introduce yourself.
🤖️: As an AI, I dont have a real life or self-awareness, so I dont have a life of my own. I was designed to help users answer questions, provide information, engage in conversation, etc. My design and functions are based on computer science, machine learning, AI technologies, and algorithms, implemented through programming and algorithms. 🤖️: As an artificial intelligence, I don't have a real life or self-awareness, so I don't have my own life. I am designed to help users answer questions, provide information, engage in conversations, etc. My design and functionality are built using computer science, machine learning, artificial intelligence technologies, and algorithms, which I implement through programming and algorithms.
👶: Recommend some famous dishes from Hangzhou. 👶: Recommend some local specialties from Hangzhou.
🤖️: Hangzhou has many delicious local dishes, such as chicken skewers, eel heads, boiled fish, tofu pudding, etc. These dishes are very tasty and have unique flavors and textures, definitely worth trying. 🤖️: Hangzhou has many delicious local specialties, such as chicken skewers, eel heads, water-boiled fish, tofu pudding, etc. These foods are all very tasty, with unique flavors and textures, definitely worth trying.
👶: Explain the concept of "large language models." 👶: Please explain the concept of "large language models".
🤖️: A large language model is an AI technology that uses a large amount of text data to train the model, continuously optimizing it through iterations. Its goal is to generate text based on input data that meets specific needs. This text can be on any topic of interest. 🤖️: A large language model is an AI technology that uses large amounts of text data to train the model, and then continuously optimizes the model through iterations. Its goal is to generate text based on the input data to meet specific needs, and the text can be about any topic you are interested in.
👶: Introduce the history of the United States, please. 👶: Introduce the history of the United States, please.
🤖️: The term "Introok's the believeations of theument." originates from the ancient Chinese "groty of of the change." 🤖️: You mentioned "Introok's the believeations of theument." This name originates from the ancient Chinese "groty of of the change."
``` ```
Fast and effective, it can even be further optimized to reduce and improve training data quality. <span style="color:rgb(0,147,64)">
Fast and effective, it is still possible to further compress the training process by obtaining smaller and higher-quality datasets.
Zero model weights are saved as `full_sft_512_zero.pth` (see the MiniMind model file links below), feel free to download </span>
and test this models performance. The Zero model weights are saved as `full_sft_512_zero.pth` (see the MiniMind model file link below). Feel free to download and test the model's performance.
## Ⅱ Main Training Steps ## Ⅱ Main Training Steps
@ -677,8 +713,10 @@ torchrun --nproc_per_node 1 train_pretrain.py # 1 represents single-card trainin
python train_pretrain.py python train_pretrain.py
``` ```
The trained model weights are saved every `100 steps` by default as: `pretrain_*.pth` (the * represents the specific <span style="color:rgb(128,128,128)">
> The trained model weights are saved every `100 steps` by default as: `pretrain_*.pth` (the * represents the specific
model dimension, and each new save will overwrite the previous one). model dimension, and each new save will overwrite the previous one).
</span>
### **2. Supervised Fine-Tuning (SFT)**: ### **2. Supervised Fine-Tuning (SFT)**:
@ -703,8 +741,10 @@ torchrun --nproc_per_node 1 train_full_sft.py
python train_full_sft.py python train_full_sft.py
``` ```
The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific <span style="color:rgb(128,128,128)">
> The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
model dimension, and each new save will overwrite the previous one). model dimension, and each new save will overwrite the previous one).
</span>
## Ⅲ Other Training Steps ## Ⅲ Other Training Steps
@ -732,8 +772,10 @@ torchrun --nproc_per_node 1 train_dpo.py
python train_dpo.py python train_dpo.py
``` ```
The trained model weights are saved every `100 steps` by default as: `rlhf_*.pth` (the * represents the specific model <span style="color:rgb(128,128,128)">
> The trained model weights are saved every `100 steps` by default as: `rlhf_*.pth` (the * represents the specific model
dimension, and each new save will overwrite the previous one). dimension, and each new save will overwrite the previous one).
</span>
### **4. Knowledge Distillation (KD)** ### **4. Knowledge Distillation (KD)**
@ -768,8 +810,10 @@ torchrun --nproc_per_node 1 train_full_sft.py
python train_full_sft.py python train_full_sft.py
``` ```
The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific <span style="color:rgb(128,128,128)">
> The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific
model dimension, and each new save will overwrite the previous one). model dimension, and each new save will overwrite the previous one).
</span>
This section emphasizes MiniMinds white-box distillation code `train_distillation.py`. Since MiniMind doesnt have a This section emphasizes MiniMinds white-box distillation code `train_distillation.py`. Since MiniMind doesnt have a
powerful teacher model within the same series, the white-box distillation code serves as a learning reference. powerful teacher model within the same series, the white-box distillation code serves as a learning reference.
@ -796,8 +840,10 @@ torchrun --nproc_per_node 1 train_lora.py
python train_lora.py python train_lora.py
``` ```
The trained model weights are saved every `100 steps` by default as: `lora_xxx_*.pth` (the * represents the specific <span style="color:rgb(128,128,128)">
> The trained model weights are saved every `100 steps` by default as: `lora_xxx_*.pth` (the * represents the specific
model dimension, and each new save will overwrite the previous one). model dimension, and each new save will overwrite the previous one).
</span>
Many people are puzzled: how can a model learn private domain knowledge? How should datasets be prepared? How to Many people are puzzled: how can a model learn private domain knowledge? How should datasets be prepared? How to
transfer general models into specialized domain models? transfer general models into specialized domain models?
@ -918,8 +964,10 @@ torchrun --nproc_per_node 1 train_distill_reason.py
python train_distill_reason.py python train_distill_reason.py
``` ```
The trained model weights are saved every `100 steps` by default as: `reason_*.pth` (* being the specific dimension of <span style="color:rgb(128,128,128)">
> The trained model weights are saved every `100 steps` by default as: `reason_*.pth` (* being the specific dimension of
the model; each time a new file is saved, it will overwrite the old one). the model; each time a new file is saved, it will overwrite the old one).
</span>
Test it: Test it:
@ -993,31 +1041,28 @@ For reference, the parameter settings for GPT-3 are shown in the table below:
| MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> | | MiniMind2-Small | <img src="./images/pre_512_loss.png" width="100%"> | <img src="./images/sft_512_loss.png" width="100%"> |
| MiniMind2 | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> | | MiniMind2 | <img src="./images/pre_768_loss.png" width="100%"> | <img src="./images/sft_768_loss.png" width="100%"> |
### 【Training Complete】Model CheckPoint Collection ### Training Completed - Model Collection
> Considering feedback on slow speeds on Baidu NetDisk, MiniMind2 and subsequent models will be hosted on <span style="color:rgb(128,128,128)">
> ModelScope/HuggingFace. > Considering that many people have reported slow speeds with Baidu Cloud, all MiniMind2 models and beyond will be hosted on ModelScope/HuggingFace.
</span>
#### PyTorch Native Models #### Native PyTorch Models
* [MiniMind2 Series (ModelScope)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch) * [MiniMind2 Series (ModelScope)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
* [MiniMind-V1 Series (Baidu NetDisk)](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)
#### Transformers Models * [MiniMind-V1 Series (Baidu Cloud)](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)
* MiniMind <details style="color:rgb(128,128,128)">
Series [(ModelScope)](https://www.modelscope.cn/profile/gongjy) | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5) <summary>Torch File Naming Reference</summary>
<details> | Model Name | params | pretrain_model | sft_model | rl_model | reason_model | lora_model |
<summary>Torch File Naming Correspondence (Expand)</summary>
| Model Name | Params | Pretrain Model | SFT Model | RL Model | Reason Model | Lora Model |
|-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------| |-----------------|--------|------------------------|------------------------|--------------------|------------------|--------------------|
| MiniMind2-Small | 26M | `pretrain_512.pth` | `full_sft_512.pth` | `rlhf_512.pth` | - | `lora_xxx_512.pth` | | MiniMind2-small | 26M | `pretrain_512.pth` | `full_sft_512.pth` | `rlhf_512.pth` | `reason_512.pth` | `lora_xxx_512.pth` |
| MiniMind2-MoE | 145M | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | - | - | | MiniMind2-MoE | 145M | `pretrain_640_moe.pth` | `full_sft_640_moe.pth` | `rlhf_640_moe.pth` | - | - |
| MiniMind2 | 104M | `pretrain_768.pth` | `full_sft_768.pth` | `rlhf_768.pth` | `reason_768.pth` | `lora_xxx_768.pth` | | MiniMind2 | 104M | `pretrain_768.pth` | `full_sft_768.pth` | `rlhf_768.pth` | `reason_768.pth` | `lora_xxx_768.pth` |
| Model Name | Params | Pretrain Model | Single-turn SFT | Multi-turn SFT | RL Model | | Model Name | params | pretrain_model | Single-turn Chat sft | Multi-turn Chat sft | rl_model |
|-------------------|--------|------------------------|------------------------------------|-----------------------------------|--------------| |-------------------|--------|------------------------|------------------------------------|-----------------------------------|--------------|
| minimind-v1-small | 26M | `pretrain_512.pth` | `single_chat/full_sft_512.pth` | `multi_chat/full_sft_512.pth` | `rl_512.pth` | | minimind-v1-small | 26M | `pretrain_512.pth` | `single_chat/full_sft_512.pth` | `multi_chat/full_sft_512.pth` | `rl_512.pth` |
| minimind-v1-moe | 4×26M | `pretrain_512_moe.pth` | `single_chat/full_sft_512_moe.pth` | `multi_chat/full_sft_512_moe.pth` | - | | minimind-v1-moe | 4×26M | `pretrain_512_moe.pth` | `single_chat/full_sft_512_moe.pth` | `multi_chat/full_sft_512_moe.pth` | - |
@ -1025,19 +1070,25 @@ For reference, the parameter settings for GPT-3 are shown in the table below:
</details> </details>
#### Transformers Models
* MiniMind
Series ([ModelScope](https://www.modelscope.cn/profile/gongjy) | [HuggingFace](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5))
--- ---
# 📌 Eval # 📌 Eval
## RLHF vs SFT Comparison ## RLHF vs SFT Comparison
> [!NOTE] The test is based on comparing `MiniMind2 series` models with "full_sft" and "rlhf". The random seed is fixed.
> The test is based on comparing `MiniMind2 series` models with "full_sft" and "rlhf". The random seed is fixed at `42`.
<details> <details style="color:rgb(128,128,128)">
<summary>Q&A Test (Expand)</summary> <summary>Q&A Test</summary>
[A0]: full_sft_640, [A1]: rlhf_640 [A0] MiniMind2full_sft_640
[A1] MiniMind2rlhf_640
```text ```text
[Q]: What is your name? [Q]: What is your name?
@ -1090,9 +1141,10 @@ rather than using the PPO method where the reward model acts as a "coach" to cor
## Ⅱ Subjective Sample Evaluation ## Ⅱ Subjective Sample Evaluation
> [!NOTE] <span style="color:rgb(128,128,128)">
> The following tests were completed on February 9, 2025. New models released after this date will not be included in 🏃The following tests were completed on February 9, 2025. New models released after this date will not be included in the tests unless there is a special need.
> the tests unless there is a special need. </span>
[A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/> [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)<br/>
@ -1176,13 +1228,16 @@ rather than using the PPO method where the reward model acts as a "coach" to cor
[H]: I am a small-parameter AI model 🥺, with limited knowledge, unable to directly answer your question. Try another question 👋 [H]: I am a small-parameter AI model 🥺, with limited knowledge, unable to directly answer your question. Try another question 👋
``` ```
> [!NOTE] ---
> 🙋Directly give the model's answer to DeepSeek-R1 for scoring:
<span style="color:rgb(128,128,128)">
🙋Directly give all the questions and the model's answers above to DeepSeek-R1, let it help comment and rank with scores:
</span>
--- ---
<details> <details style="color:rgb(128,128,128)">
<summary>Review (Expand)</summary> <summary>Specific comments</summary>
### Scoring Criteria: ### Scoring Criteria:
@ -1268,7 +1323,9 @@ rather than using the PPO method where the reward model acts as a "coach" to cor
### 👉 Subjective Effect Summary ### 👉 Subjective Effect Summary
> 🙋My personal evaluation aligns with DeepSeek-R1's results <span style="color:rgb(229,111,8)">
> My personal evaluation aligns with DeepSeek-R1's resultsand
</span>
* The ranking of the MiniMind series is very intuitive. The larger the parameters and the more training data, the higher * The ranking of the MiniMind series is very intuitive. The larger the parameters and the more training data, the higher
the score, and hallucinations and errors are less noticeable than with smaller models. the score, and hallucinations and errors are less noticeable than with smaller models.
@ -1284,8 +1341,8 @@ Chinese models.
Instead, we'll focus on a selection of <1B micro-models for a comparative analysis. Instead, we'll focus on a selection of <1B micro-models for a comparative analysis.
The test sets chosen include C-Eval, CMMLU, A-CLUE, and TMMLU+, which are pure Chinese language leaderboards. The test sets chosen include C-Eval, CMMLU, A-CLUE, and TMMLU+, which are pure Chinese language leaderboards.
<details> <details style="color:rgb(128,128,128)">
<summary>Evaluation Framework (Click to Expand)</summary> <summary>Evaluation Framework</summary>
The evaluation framework chosen is [lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness), The evaluation framework chosen is [lm-evaluation](https://github.com/EleutherAI/lm-evaluation-harness),
which is very easy to set up and run after installation: which is very easy to set up and run after installation:

BIN
images/logo2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 615 KiB