update readme

2024-10-05 00:35:54 +08:00 · 2024-10-05 00:35:54 +08:00 · eb875da306
commit eb875da306
parent 1b864453fa
2 changed files with 70 additions and 44 deletions
--- a/README.md
+++ b/README.md
@ -28,7 +28,10 @@
 * 本开源项目旨在完全从0开始，最快仅用3小时！即可训练出仅为26M大小的微型语言模型**MiniMind**。
 * **MiniMind**极其轻量，体积约是 GPT3 的 $\frac{1}{7000}$，力求做到最普通的个人GPU也可快速推理甚至训练。
 * **MiniMind**改进自DeepSeek-V2、Llama3结构，项目包含整个数据处理、pretrain、sft、dpo的全部阶段，包含混合专家(MoE)模型。
-* 这是一个既是开源项目，又是入门LLM教程，同时也是一个初具雏形的开源模型，希望能起到抛砖引玉的作用。
+* 这不仅是一个开源模型的实现，也是入门大语言模型（LLM）的教程。
+* 希望此项目能为研究者提供一个抛砖引玉的入门示例，帮助大家快速上手并对LLM领域产生更多的探索与创新。
+
+  > 为防止误读，「最快3小时」是指您需要具备＞本人硬件配置的机器，具体规格的详细信息将在下文提供。

 ---

@ -53,7 +56,7 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
 直接从0开始训练一个极其轻量的语言模型。

 > [!TIP]
-> （截至2024-9-17）minimind训练了3个型号模型，最小仅需26M（0.02B），即可具备流畅的对话能力！
+> （截至2024-9-17）MiniMind系列已完成了3个型号模型的预训练，最小仅需26M（0.02B），即可具备流畅的对话能力！

 | 模型 (大小)                 | tokenizer长度 | 推理占用   | release    | 主观评分（/100） | 
 |-------------------------|-------------|--------|------------|------------|
@ -61,7 +64,7 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
 | minimind-v1-moe (4×26M) | 6400        | 1.0 GB | 2024.09.17 | 55'        |
 | minimind-v1 (108M)      | 6400        | 1.0 GB | 2024.09.01 | 60'        |

-> 该分析在一个带有Torch 2.1.2、CUDA 12.2和Flash Attention 2的RTX 3090 GPU上运行。
+> 该分析在具有Torch 2.1.2、CUDA 12.2和Flash Attention 2的2×RTX 3090 GPU上进行。



@ -77,10 +80,19 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055

 ### 👉**最近更新**

+<details close> 
+<summary> <b>2024-10-05 (newest 🎉)</b> </summary>
+
+- 为MiniMind拓展了多模态能力之---视觉
+
+- 移步孪生项目[minimind-v](https://github.com/jingyaogong/minimind-v)查看详情！
+
+</details>
+
 <details close> 
 <summary> <b>2024-09-27</b> </summary>

- 👉09-27更新pretrain数据集的预处理方式，为了保证文本完整性，放弃预处理成.bin训练的形式（轻微牺牲训练速度）。
+- 09-27更新pretrain数据集的预处理方式，为了保证文本完整性，放弃预处理成.bin训练的形式（轻微牺牲训练速度）。

 - 目前pretrain预处理后的文件命名为：pretrain_data.csv。

@ -119,6 +131,13 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055

 仅是我个人的软硬件环境配置，自行酌情更改：

+```bash
+CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
+内存：128 GB
+显卡：NVIDIA GeForce RTX 3090(24GB) * 2
+环境：python 3.9 + Torch 2.1.2 + DDP单机多卡训练
+```
+
 * Ubuntu == 20.04
 * Python == 3.9
 * Pytorch == 2.1.2
@ -182,17 +201,18 @@ streamlit run fast_inference.py
    * 2.1 下载[数据集下载地址](#数据集下载地址)放到`./dataset`目录下

    * 2.2 `python data_process.py`处理数据集，例如pretrain数据提前进行token-encoder、sft数据集抽离qa到csv文件
-
+  
    * 2.3 在`./model/LMConfig.py` 中调整model的参数配置
-    * 2.4 `python 1-pretrain.py` 执行预训练
-    * 2.5 `python 3-full_sft.py` 执行指令微调
+      > 这里仅需调整dim和n_layers和use_moe参数，分别是`(512+8)`或`(768+16)`，对应于`minimind-v1-small`和`minimind-v1`
+    * 2.4 `python 1-pretrain.py` 执行预训练，得到 `pretrain_*.pth` 作为预训练的输出权重
+    * 2.5 `python 3-full_sft.py` 执行指令微调，得到 `full_sft_*.pth` 作为指令微调的输出权重
    * 2.6 `python 4-lora_sft.py` 执行lora微调（非必须）
    * 2.7 `python 5-dpo_train.py` 执行DPO人类偏好强化学习对齐（非必须）
 * 3、测试模型推理效果
-    * 确保需要使用的，训练完成的参数权重位于`./out/`目录下
-    * 也可以直接去[训练完成的模型权重](#训练完成的模型权重)下载使用我训练好的
+    * 确保需要使用的，训练完成的参数权重`*.pth`文件位于`./out/`目录下
+    * 也可以直接去[训练完成的模型权重](#训练完成的模型权重)下载使用我训练好的`*.pth`权重文件
       ```text
-      out
+      minimind/out
      ├── multi_chat
      │   ├── full_sft_512.pth
      │   ├── full_sft_512_moe.pth
@ -211,26 +231,26 @@ streamlit run fast_inference.py

 🍭 【Tip】预训练和全参微调pretrain和full_sft均支持多卡加速

-* 单机N卡启动训练(DDP)
-    ```bash
-    torchrun --nproc_per_node N 1-pretrain.py
-    # and
-    torchrun --nproc_per_node N 3-full_sft.py
-    ```
-* 单机N卡启动训练(DeepSpeed)
-    ```bash
-    deepspeed --master_port 29500 --num_gpus=N 1-pretrain.py
-    # and
-    deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
-    ```
+  * 单机N卡启动训练(DDP)
+      ```bash
+      torchrun --nproc_per_node N 1-pretrain.py
+      # and
+      torchrun --nproc_per_node N 3-full_sft.py
+      ```
+  * 单机N卡启动训练(DeepSpeed)
+      ```bash
+      deepspeed --master_port 29500 --num_gpus=N 1-pretrain.py
+      # and
+      deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
+      ```

-* 记录训练过程
-    ```bash
-    torchrun --nproc_per_node N 1-pretrain.py --use_wandb
-    # and
-    python 1-pretrain.py --use_wandb
-    ```
-    通过添加`--use_wandb`参数，可以记录训练过程，训练完成后，可以在wandb网站上查看训练过程。通过修改`wandb_project`和`wandb_run_name`参数，可以指定项目名称和运行名称。
+  * 记录训练过程
+      ```bash
+      torchrun --nproc_per_node N 1-pretrain.py --use_wandb
+      # and
+      python 1-pretrain.py --use_wandb
+      ```
+      通过添加`--use_wandb`参数，可以记录训练过程，训练完成后，可以在wandb网站上查看训练过程。通过修改`wandb_project`和`wandb_run_name`参数，可以指定项目名称和运行名称。

 # 📌 Data sources

@ -345,13 +365,6 @@ minimind目前训练的模型版本见下表：

 # 📌 Experiment

-```bash
-CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
-内存：128 GB
-显卡：NVIDIA GeForce RTX 3090(24GB) * 2
-环境：python 3.9 + Torch 2.1.2 + DDP多卡训练
-```
-
 | Model Name        | params | len_vocab | batch_size | pretrain_time     | sft_single_time   | sft_multi_time      |
 |-------------------|--------|-----------|------------|-------------------|-------------------|---------------------|
 | minimind-v1-small | 26M    | 6400      | 64         | ≈2 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
--- a/README_en.md
+++ b/README_en.md
@ -31,8 +31,10 @@
  inference and even training on CPUs.
 * **MiniMind** is an improvement on the DeepSeek-V2 and Llama3 architectures. The project includes all stages of data
  processing, pretraining, SFT, and DPO, and features a Mixture of Experts (MoE) model.
-* This project is not only an open-source initiative but also a beginner's tutorial for LLMs, and serves as a nascent
-  open-source model with the hope of inspiring further development.
+* This is not only the implementation of an open-source model, but also a tutorial for getting started with large language models (LLMs).
+* We hope that this project serves as a stepping stone for researchers and developers, providing an introductory example to help them quickly get started and foster more exploration and innovation in the LLM field.
+
+  > To avoid any misunderstanding, "fastest 3 hours" refers to the requirement of using hardware with higher specifications than the author's setup. Detailed specifications will be provided below.

 ---

@ -84,6 +86,15 @@ We hope this open-source project helps LLM beginners get started quickly!

 ### 👉**Recent Updates**

+<details close> 
+<summary> <b>2024-10-05 (newest 🎉)</b> </summary>
+
+- Added visual capabilities to MiniMind-V(ision)
+
+- Check out the twin project [minimind-v](https://github.com/jingyaogong/minimind-v) for more details!
+
+</details>
+
 <details close> 
 <summary> <b>2024-09-27</b> </summary>

@ -127,6 +138,14 @@ We hope this open-source project helps LLM beginners get started quickly!

 These are my personal software and hardware environment configurations. Please adjust according to your own setup:

+
+```bash
+CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
+Memory: 128 GB
+GPU: NVIDIA GeForce RTX 3090 (24GB) * 2
+Environment: python 3.9 + Torch 2.1.2 + DDP multi-GPU training
+```
+
 * Ubuntu == 20.04
 * Python == 3.9
 * Pytorch == 2.1.2
@ -380,12 +399,6 @@ shown in the table below:

 # 📌 Experiment

-```bash
-CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
-Memory: 128 GB
-GPU: NVIDIA GeForce RTX 3090 (24GB) * 2
-Environment: python 3.9 + Torch 2.1.2 + DDP multi-GPU training
-```

 | Model Name        | params | len_vocab | batch_size | pretrain_time     | sft_single_time   | sft_multi_time      |
 |-------------------|--------|-----------|------------|-------------------|-------------------|---------------------|