diff --git a/README.md b/README.md index 34c9d27..ba06419 100644 --- a/README.md +++ b/README.md @@ -50,6 +50,26 @@ --- +
+ + MiniMind Logo + + Multi Icon +    + + + Hugging Face Logo + + + Multi Icon + + + ModelScope Logo + +
+ +--- + # 📌 Introduction @@ -173,28 +193,6 @@ # 📌 快速开始 ---- - -
- - MiniMind Logo - MiniMind Series - - × -   - - Hugging Face Logo - Hugging Face - - & - - ModelScope Logo - -
- ---- - -
分享本人的软硬件配置(仅供参考) @@ -282,9 +280,7 @@ print(torch.cuda.is_available()) python train_pretrain.py ``` - > 执行预训练,得到 `pretrain_*.pth` 作为预训练的输出权重(其中*为模型的dimension,默认为512) - **3.2 监督微调(学对话方式)** @@ -293,9 +289,8 @@ python train_pretrain.py python train_full_sft.py ``` - > 执行监督微调,得到 `full_sft_*.pth` 作为指令微调的输出权重(其中`full`即为全参数微调) - + --- @@ -489,7 +484,7 @@ quality(当然也还不算high,提升数据质量无止尽)。 MiniMind训练数据集 ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong)) -> 无需全部clone,可单独下载所需的文件 +> 无需全部clone,可单独下载所需的文件 将下载的数据集文件放到`./dataset/`目录下(✨为推荐的必须项) @@ -614,17 +609,15 @@ MiniMind的整体结构一致,只是在RoPE计算、推理函数和FFN层的 -✨ -基于单卡NVIDIA 3090的`MiniMind-Zero`从0训练仅需`2小时` + `3块钱`,实现ChatBot效果! - +✨基于单卡NVIDIA 3090的`MiniMind-Zero`从0训练仅需`2小时` + `3块钱`,实现ChatBot效果! + ✨PS:若采用8卡4090训练,总用时甚至可以压缩到10分钟以内!(由于时间更短,花费同样3元左右,与单卡成本相当) ✨以极低极低的门槛,实现人人可玩的大模型自由!这正是MiniMind系列的诞生初衷! -✨ -仅价值`3块钱`成本的`MiniMind-Zero`并不是噱头!Chat测试: - +✨仅价值`3块钱`成本的`MiniMind-Zero`并不是噱头!Chat测试: + ```textmate 👶: 请介绍一下自己。 @@ -640,9 +633,7 @@ MiniMind的整体结构一致,只是在RoPE计算、推理函数和FFN层的 🤖️: 您提到的“Introok's the believeations of theument." 这个名字来源于中国古代的"groty of of the change." ``` - 极速且初具效果,甚至仍然可以进一步压缩获取更小更优质的训练数据。 - Zero模型权重保存为 `full_sft_512_zero.pth`(见下文MiniMind模型文件链接),如有兴趣可下载检验此模型效果。 @@ -663,10 +654,8 @@ torchrun --nproc_per_node 1 train_pretrain.py # 1即为单卡训练,可根据 python train_pretrain.py ``` - > 训练后的模型权重文件默认每隔`100步`保存为: `pretrain_*.pth`(* 为模型具体dimension,每次保存时新文件会覆盖旧文件) - ### **2. 有监督微调(Supervised Fine-Tuning)**: @@ -684,8 +673,8 @@ torchrun --nproc_per_node 1 train_full_sft.py python train_full_sft.py ``` -> 训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`(* -为模型具体dimension,每次保存时新文件会覆盖旧文件) +> 训练后的模型权重文件默认每隔`100步`保存为: `full_sft_*.pth`(* +为模型具体dimension,每次保存时新文件会覆盖旧文件) ## Ⅲ 其它训练步骤 @@ -699,9 +688,9 @@ python train_full_sft.py DPO通过推导PPO奖励模型的显式解,把在线奖励模型换成离线数据,Ref模型输出可以提前保存。 DPO性能几乎不变,只用跑 actor_model 和 ref_model 两个模型,大大节省显存开销和增加训练稳定性。 - -注:RLHF训练步骤**并非必须**,此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”,有利(符合偏好、减少有害内容)也有弊(样本收集昂贵、反馈偏差、多样性损失)。 - + +> 注:RLHF训练步骤**并非必须**,此步骤难以提升模型“智力”而通常仅用于提升模型的“礼貌”,有利(符合偏好、减少有害内容)也有弊(样本收集昂贵、反馈偏差、多样性损失)。 + ```bash torchrun --nproc_per_node 1 train_dpo.py @@ -709,8 +698,8 @@ torchrun --nproc_per_node 1 train_dpo.py python train_dpo.py ``` -> 训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`(* -为模型具体dimension,每次保存时新文件会覆盖旧文件) +> 训练后的模型权重文件默认每隔`100步`保存为: `rlhf_*.pth`(* +为模型具体dimension,每次保存时新文件会覆盖旧文件) ### **4. 知识蒸馏(Knowledge Distillation, KD)** @@ -737,8 +726,7 @@ torchrun --nproc_per_node 1 train_full_sft.py python train_full_sft.py ``` -> 训练后的模型权重文件默认每隔`100步`同样保存为: `full_sft_*.pth`(* -为模型具体dimension,每次保存时新文件会覆盖旧文件) +> 训练后的模型权重文件默认每隔`100步`同样保存为: `full_sft_*.pth`(*为模型具体dimension,每次保存时新文件会覆盖旧文件) 此处应当着重介绍MiniMind实现的白盒蒸馏代码`train_distillation.py`,由于MiniMind同系列本身并不存在强大的教师模型,因此白盒蒸馏代码仅作为学习参考。 @@ -761,10 +749,10 @@ torchrun --nproc_per_node 1 train_lora.py python train_lora.py ``` - + > 训练后的模型权重文件默认每隔`100步`保存为: `lora_xxx_*.pth`(* 为模型具体dimension,每次保存时新文件会覆盖旧文件) - + 非常多的人困惑,如何使模型学会自己私有领域的知识?如何准备数据集?如何迁移通用领域模型打造垂域模型? 这里举几个例子,对于通用模型,医学领域知识欠缺,可以尝试在原有模型基础上加入领域知识,以获得更好的性能。 @@ -864,8 +852,7 @@ torchrun --nproc_per_node 1 train_distill_reason.py python train_distill_reason.py ``` -> 训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`(* -为模型具体dimension,每次保存时新文件会覆盖旧文件) +> 训练后的模型权重文件默认每隔`100步`保存为: `reason_*.pth`(*为模型具体dimension,每次保存时新文件会覆盖旧文件) 测试一下: @@ -922,9 +909,9 @@ MobileLLM提出架构的深度比宽度更重要,「深而窄」的「瘦长 ## Ⅴ 训练结果 - + MiniMind2 模型训练损失走势(由于数据集在训练后又更新清洗多次,因此Loss仅供参考) - + | models | pretrain (length-512) | sft (length-512) | |-----------------|----------------------------------------------------|----------------------------------------------------| @@ -933,9 +920,9 @@ MiniMind2 模型训练损失走势(由于数据集在训练后又更新清洗 ### 训练完成-模型合集 - + > 考虑到多人反应百度网盘速度慢,MiniMind2及以后全部使用ModelScope/HuggingFace托管。 - + #### PyTorch原生模型 @@ -1010,9 +997,7 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的,和minimind ## Ⅱ 主观样例测评 - 🏃以下测试于2025-02-09完成,此日期后发布的新模型,无特殊需要时将不加入测试。 - [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
@@ -1104,11 +1089,7 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的,和minimind --- - 🙋‍直接把以上所有问题和模型的回答丢给DeepSeek-R1,让它帮忙点评和排名打分: - - ----
@@ -1190,9 +1171,7 @@ DPO和在线PPO的区别在于reject和chosen都是离线准备的,和minimind ### 👉主观效果总结 - 个人主观评价与DeepSeek-R1基本相符,其中: - * MiniMind系列的排序非常符合直觉,参数越大+训练数据越充分评分越高,幻觉和错误都会比小模型肉眼可见的好。 diff --git a/README_en.md b/README_en.md index 7e0654d..b6d474e 100644 --- a/README_en.md +++ b/README_en.md @@ -54,6 +54,27 @@ --- +
+ + MiniMind Logo + + Multi Icon +    + + + Hugging Face Logo + + + Multi Icon + + + ModelScope Logo + +
+ +--- + + # 📌 Introduction @@ -184,26 +205,6 @@ We hope this open-source project can help LLM beginners quickly get started! # 📌 Quick Start ---- - -
- - MiniMind Logo - MiniMind Series - - × -   - - Hugging Face Logo - Hugging Face - - & - - ModelScope Logo - -
- ----
Sharing My Hardware and Software Configuration (For Reference Only) @@ -297,9 +298,8 @@ needs and GPU resources. python train_pretrain.py ``` - > Execute pretraining to get `pretrain_*.pth` as the output weights for pretraining (where * represents the model dimension, default is 512). - + **3.2 Supervised Fine-Tuning (Learning Dialogue Style)** @@ -307,9 +307,8 @@ python train_pretrain.py python train_full_sft.py ``` - > Execute supervised fine-tuning to get `full_sft_*.pth` as the output weights for instruction fine-tuning (where `full` represents full parameter fine-tuning). - + --- @@ -660,10 +659,8 @@ Reference model parameter versions are shown in the table below:
-✨ -With a single NVIDIA 3090 GPU, you can train `MiniMind-Zero` from scratch in just `2 hours` and for a cost of +✨With a single NVIDIA 3090 GPU, you can train `MiniMind-Zero` from scratch in just `2 hours` and for a cost of only `3 RMB`, achieving ChatBot functionality! - ✨PS: If training on 8 GPUs with 4090s, the total time can be compressed to under 10 minutes! (Despite the shorter time, the cost is still around 3 RMB, which is comparable to the single GPU cost.) @@ -671,9 +668,7 @@ the cost is still around 3 RMB, which is comparable to the single GPU cost.) ✨This enables ultra-low barriers to entry, making it possible for everyone to experiment with large models! This is the original purpose behind the creation of the MiniMind series! -✨ -The `MiniMind-Zero` model, which costs only `3 RMB`, is not a gimmick! Chat test results: - +✨The `MiniMind-Zero` model, which costs only `3 RMB`, is not a gimmick! Chat test results: ```textmate 👶: Please introduce yourself. @@ -689,9 +684,7 @@ The `MiniMind-Zero` model, which costs only `3 RMB`, is not a gimmick! Chat test 🤖️: You mentioned "Introok's the believeations of theument." This name originates from the ancient Chinese "groty of of the change." ``` - Fast and effective, it is still possible to further compress the training process by obtaining smaller and higher-quality datasets. - The Zero model weights are saved as `full_sft_512_zero.pth` (see the MiniMind model file link below). Feel free to download and test the model's performance. ## Ⅱ Main Training Steps @@ -713,10 +706,9 @@ torchrun --nproc_per_node 1 train_pretrain.py # 1 represents single-card trainin python train_pretrain.py ``` - > The trained model weights are saved every `100 steps` by default as: `pretrain_*.pth` (the * represents the specific model dimension, and each new save will overwrite the previous one). - + ### **2. Supervised Fine-Tuning (SFT)**: @@ -741,10 +733,8 @@ torchrun --nproc_per_node 1 train_full_sft.py python train_full_sft.py ``` - > The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific model dimension, and each new save will overwrite the previous one). - ## Ⅲ Other Training Steps @@ -772,10 +762,8 @@ torchrun --nproc_per_node 1 train_dpo.py python train_dpo.py ``` - > The trained model weights are saved every `100 steps` by default as: `rlhf_*.pth` (the * represents the specific model dimension, and each new save will overwrite the previous one). - ### **4. Knowledge Distillation (KD)** @@ -810,10 +798,8 @@ torchrun --nproc_per_node 1 train_full_sft.py python train_full_sft.py ``` - > The trained model weights are saved every `100 steps` by default as: `full_sft_*.pth` (the * represents the specific model dimension, and each new save will overwrite the previous one). - This section emphasizes MiniMind’s white-box distillation code `train_distillation.py`. Since MiniMind doesn’t have a powerful teacher model within the same series, the white-box distillation code serves as a learning reference. @@ -840,10 +826,8 @@ torchrun --nproc_per_node 1 train_lora.py python train_lora.py ``` - > The trained model weights are saved every `100 steps` by default as: `lora_xxx_*.pth` (the * represents the specific model dimension, and each new save will overwrite the previous one). - Many people are puzzled: how can a model learn private domain knowledge? How should datasets be prepared? How to transfer general models into specialized domain models? @@ -964,10 +948,8 @@ torchrun --nproc_per_node 1 train_distill_reason.py python train_distill_reason.py ``` - > The trained model weights are saved every `100 steps` by default as: `reason_*.pth` (* being the specific dimension of the model; each time a new file is saved, it will overwrite the old one). - Test it: @@ -1043,9 +1025,7 @@ For reference, the parameter settings for GPT-3 are shown in the table below: ### Training Completed - Model Collection - > Considering that many people have reported slow speeds with Baidu Cloud, all MiniMind2 models and beyond will be hosted on ModelScope/HuggingFace. - #### Native PyTorch Models @@ -1141,11 +1121,7 @@ rather than using the PPO method where the reward model acts as a "coach" to cor ## Ⅱ Subjective Sample Evaluation - 🏃The following tests were completed on February 9, 2025. New models released after this date will not be included in the tests unless there is a special need. - - - [A] [MiniMind2 (0.1B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
[B] [MiniMind2-MoE (0.15B)](https://www.modelscope.cn/models/gongjy/MiniMind2-PyTorch)
@@ -1230,11 +1206,8 @@ rather than using the PPO method where the reward model acts as a "coach" to cor --- - 🙋‍Directly give all the questions and the model's answers above to DeepSeek-R1, let it help comment and rank with scores: - ----
Specific comments @@ -1323,9 +1296,7 @@ rather than using the PPO method where the reward model acts as a "coach" to cor ### 👉 Subjective Effect Summary - -> My personal evaluation aligns with DeepSeek-R1's results,and: - +My personal evaluation aligns with DeepSeek-R1's results,and: * The ranking of the MiniMind series is very intuitive. The larger the parameters and the more training data, the higher the score, and hallucinations and errors are less noticeable than with smaller models. diff --git a/images/multi.png b/images/multi.png new file mode 100644 index 0000000..0334c93 Binary files /dev/null and b/images/multi.png differ