update readme

2024-10-05 22:59:00 +08:00 · 2024-10-05 22:59:00 +08:00 · e4b8789d8c
commit e4b8789d8c
parent eb875da306
2 changed files with 84 additions and 65 deletions
--- a/README.md
+++ b/README.md
@ -188,14 +188,16 @@ streamlit run fast_inference.py
 # 📌 Quick Start
-* 0、环境安装
+* 0、克隆项目代码
    ```bash
    git clone https://github.com/jingyaogong/minimind.git & cd minimind
    ```
 * 1、环境安装
  ```bash
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
  ```
-* 1、克隆项目代码
+
    ```text
    git clone https://github.com/jingyaogong/minimind.git
    ```
 * 2、如果你需要自己训练
    * 2.1 下载[数据集下载地址](#数据集下载地址)放到`./dataset`目录下
@ -231,26 +233,27 @@ streamlit run fast_inference.py
 🍭 【Tip】预训练和全参微调pretrain和full_sft均支持多卡加速
-  * 单机N卡启动训练(DDP)
+* 单机N卡启动训练(DDP)
    ```bash
    torchrun --nproc_per_node N 1-pretrain.py
    # and
    torchrun --nproc_per_node N 3-full_sft.py
    ```
-  * 单机N卡启动训练(DeepSpeed)
+* 单机N卡启动训练(DeepSpeed)
    ```bash
    deepspeed --master_port 29500 --num_gpus=N 1-pretrain.py
    # and
    deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
    ```
-  * 记录训练过程
+* 记录训练过程
    ```bash
    torchrun --nproc_per_node N 1-pretrain.py --use_wandb
    # and
    python 1-pretrain.py --use_wandb
    ```
-      通过添加`--use_wandb`参数，可以记录训练过程，训练完成后，可以在wandb网站上查看训练过程。通过修改`wandb_project`和`wandb_run_name`参数，可以指定项目名称和运行名称。
+  通过添加`--use_wandb`参数，可以记录训练过程，训练完成后，可以在wandb网站上查看训练过程。通过修改`wandb_project`
  和`wandb_run_name`参数，可以指定项目名称和运行名称。
 # 📌 Data sources
--- a/README_en.md
+++ b/README_en.md
@ -31,10 +31,13 @@
  inference and even training on CPUs.
 * **MiniMind** is an improvement on the DeepSeek-V2 and Llama3 architectures. The project includes all stages of data
  processing, pretraining, SFT, and DPO, and features a Mixture of Experts (MoE) model.
-* This is not only the implementation of an open-source model, but also a tutorial for getting started with large language models (LLMs).
+* This is not only the implementation of an open-source model, but also a tutorial for getting started with large
-* We hope that this project serves as a stepping stone for researchers and developers, providing an introductory example to help them quickly get started and foster more exploration and innovation in the LLM field.
+  language models (LLMs).
 * We hope that this project serves as a stepping stone for researchers and developers, providing an introductory example
  to help them quickly get started and foster more exploration and innovation in the LLM field.
-  > To avoid any misunderstanding, "fastest 3 hours" refers to the requirement of using hardware with higher specifications than the author's setup. Detailed specifications will be provided below.
+  > To avoid any misunderstanding, "fastest 3 hours" refers to the requirement of using hardware with higher
  specifications than the author's setup. Detailed specifications will be provided below.
 ---
@ -77,7 +80,8 @@ The project includes:
 - Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA
  fine-tuning, and DPO preference optimization, along with datasets and sources.
 - Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`.
- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed), Use wandb to visualize the training process. The training process allows for stopping and resuming at any point.
+- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed), Use wandb to visualize the training process. The
  training process allows for stopping and resuming at any point.
 - Code for testing the model on the Ceval dataset.
 - Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat
  UIs (such as FastGPT, Open-WebUI, etc.).
@ -98,7 +102,8 @@ We hope this open-source project helps LLM beginners get started quickly!
 <details close> 
 <summary> <b>2024-09-27</b> </summary>
- 👉Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed).
+- 👉Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the
  preprocessing into .bin training format (slightly sacrificing training speed).
 - The current filename for the pretrain data after preprocessing is: pretrain_data.csv.
@ -138,7 +143,6 @@ We hope this open-source project helps LLM beginners get started quickly!
 These are my personal software and hardware environment configurations. Please adjust according to your own setup:
 ```bash
 CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
 Memory: 128 GB
@ -197,22 +201,19 @@ The project has been deployed to ModelScope makerspace, where you can experience
 # 📌 Quick Start
-*
+* 0.Clone the project code
    0. Install the required dependencies
-```bash
+  ```text
  git clone https://github.com/jingyaogong/minimind.git & cd minimind
  ```
 * 1.Install the required dependencies
  ```bash
    pip install -r requirements.txt
-```
+  ```
-*
+* 2.If you need to train the model yourself
    1. Clone the project code
 ```text
 git clone https://github.com/jingyaogong/minimind.git
 ```
 *
    2. If you need to train the model yourself
    * 2.1 Download the [dataset download link](#dataset-download-links) and place it in the `./dataset` directory.
@ -225,8 +226,7 @@ git clone https://github.com/jingyaogong/minimind.git
    * 2.6 Perform LoRA fine-tuning (optional) with `python 4-lora_sft.py`.
    * 2.7 Execute DPO human preference reinforcement learning alignment (optional) with `python 5-dpo_train.py`.
-*
+* 3.Test model inference performance
    3. Test model inference performance
    * Ensure that the required trained parameter weights are located in the `./out/` directory.
    * You can also directly download and use the trained model weights
@ -270,7 +270,9 @@ git clone https://github.com/jingyaogong/minimind.git
    # and
    python 1-pretrain.py --use_wandb
    ```
-    By adding the `--use_wandb` parameter, you can record the training process. After training is complete, you can view the training process on the wandb website. You can specify the project name and run name by modifying the `wandb_project` and `wandb_run_name` parameters.
+  By adding the `--use_wandb` parameter, you can record the training process. After training is complete, you can view
  the training process on the wandb website. You can specify the project name and run name by modifying
  the `wandb_project` and `wandb_run_name` parameters.
 # 📌 Data sources
@ -399,7 +401,6 @@ shown in the table below:
 # 📌 Experiment
 | Model Name        | params | len_vocab | batch_size | pretrain_time     | sft_single_time   | sft_multi_time      |
 |-------------------|--------|-----------|------------|-------------------|-------------------|---------------------|
 | minimind-v1-small | 26M    | 6400      | 64         | ≈2 hour (1 epoch) | ≈2 hour (1 epoch) | ≈0.5 hour (1 epoch) |
@ -505,7 +506,7 @@ better with the scaling law for small models.
 [baidu](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)
 | Model Name        | params | Config                      | pretrain_model                                                  | single_sft_model                                                | multi_sft_model                                                 |
-|-------------------|--------|-----------------------------|-----------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|
+|-------------------|--------|-----------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|
 | minimind-v1-small | 26M    | d_model=512<br/>n_layers=8  | [URL](https://pan.baidu.com/s/1wP_cAIc8cgaJ6CxUmR9ECQ?pwd=6666) | [URL](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [URL](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |
 | minimind-v1-moe   | 4×26M  | d_model=512<br/>n_layers=8  | [URL](https://pan.baidu.com/s/1IZdkzPRhbZ_bSsRL8vInjg?pwd=6666) | [URL](https://pan.baidu.com/s/1tqB-GMvuiGQBvEl-yZ-oBw?pwd=6666) | [URL](https://pan.baidu.com/s/1GHJ2T4904EcT1u8l1rVqtg?pwd=6666) |
 | minimind-v1       | 108M   | d_model=768<br/>n_layers=16 | [URL](https://pan.baidu.com/s/1B60jYo4T8OmJI0ooqsixaA?pwd=6666) | [URL](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [URL](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) |
@ -618,14 +619,26 @@ better with the scaling law for small models.
 ## 👉 Summary of Effects
-* The ranking of the minimind series (ABC) aligns with intuition, with minimind-v1(0.1B) scoring the highest, and its responses to common sense questions are mostly error-free and free of hallucinations.
+* The ranking of the minimind series (ABC) aligns with intuition, with minimind-v1(0.1B) scoring the highest, and its
  responses to common sense questions are mostly error-free and free of hallucinations.
    * Surprisingly, minimind-v1-small(0.02B), with only 26M parameters, can perform nearly as well as minimind-v1(0.1B).
-    * minimind-v1(0.1B) underwent less than 2 epochs of SFT (Supervised Fine-Tuning) due to being prematurely killed to free up resources for smaller models. Despite not being fully trained, it still achieved the best performance, demonstrating that larger models generally outperform smaller ones.
+    * minimind-v1(0.1B) underwent less than 2 epochs of SFT (Supervised Fine-Tuning) due to being prematurely killed to
-    * minimind-v1-moe(0.1B) performed only slightly better than minimind-v1-small(0.02B), also due to early termination to free up resources for other training. However, the MoE (Mixture of Experts) model, with its sparse multi-Experts mode, requires more training epochs to fully activate and train all FFN (Feed-Forward Network) layer experts. In the current setup with 3 epochs, the training is not yet sufficient.
+      free up resources for smaller models. Despite not being fully trained, it still achieved the best performance,
-      Early experiments with minimind on the Yi-Tokenizer showed that a fully trained MoE version could outperform dense small models visibly. This aspect may need to be reserved for future training and updates to v2 and v3 versions when more server resources are available.
+      demonstrating that larger models generally outperform smaller ones.
    * minimind-v1-moe(0.1B) performed only slightly better than minimind-v1-small(0.02B), also due to early termination
      to free up resources for other training. However, the MoE (Mixture of Experts) model, with its sparse
      multi-Experts mode, requires more training epochs to fully activate and train all FFN (Feed-Forward Network) layer
      experts. In the current setup with 3 epochs, the training is not yet sufficient.
      Early experiments with minimind on the Yi-Tokenizer showed that a fully trained MoE version could outperform dense
      small models visibly. This aspect may need to be reserved for future training and updates to v2 and v3 versions
      when more server resources are available.
-* The responses from Model E appear to be quite good to the naked eye, although there are occasional instances of hallucinations and fabrications. However, both GPT-4o and Deepseek's evaluations consistently noted that it "provides overly verbose and repetitive information, and contains hallucinations."
+* The responses from Model E appear to be quite good to the naked eye, although there are occasional instances of
-  This evaluation seems somewhat strict, as even a small number of hallucinated words in a 100-word response can easily result in a low score. Given that Model E was pre-trained on longer texts and a larger dataset, its responses appear more comprehensive. In models of similar size, both the quantity and quality of the data are crucial.
+  hallucinations and fabrications. However, both GPT-4o and Deepseek's evaluations consistently noted that it "provides
  overly verbose and repetitive information, and contains hallucinations."
  This evaluation seems somewhat strict, as even a small number of hallucinated words in a 100-word response can easily
  result in a low score. Given that Model E was pre-trained on longer texts and a larger dataset, its responses appear
  more comprehensive. In models of similar size, both the quantity and quality of the data are crucial.
 > 🙋‍♂️ Personal Subjective Evaluation: E>C>B≈A>D
@ -759,16 +772,22 @@ your model with third-party UIs, such as fastgpt, OpenWebUI, etc.
 > [!TIP]
 > If you find `MiniMind` helpful, please give us a ⭐ on GitHub.<br/>
-> Given the length and the limitations of our expertise, there may be errors. We welcome discussions and corrections in the Issues section.<br/>
+> Given the length and the limitations of our expertise, there may be errors. We welcome discussions and corrections in
 > the Issues section.<br/>
 > Your support is the driving force behind our continuous improvement of the project!
 > [!NOTE]
-> An individual's resources, energy, and time are limited, so we encourage everyone to participate and contribute collectively. If you have trained model weights, you are welcome to share them in the Discussions or Issues sections.<br/>
+> An individual's resources, energy, and time are limited, so we encourage everyone to participate and contribute
-> These models can be new versions of MiniMind tailored for specific downstream tasks or vertical domains (such as sentiment recognition, healthcare, psychology, finance, legal Q&A, etc.).<br/>
+> collectively. If you have trained model weights, you are welcome to share them in the Discussions or Issues
-> They can also be new versions of MiniMind models that have undergone extended training, exploring longer text sequences, larger volumes (such as 0.1B+), or more extensive datasets.<br/>
+> sections.<br/>
 > These models can be new versions of MiniMind tailored for specific downstream tasks or vertical domains (such as
 > sentiment recognition, healthcare, psychology, finance, legal Q&A, etc.).<br/>
 > They can also be new versions of MiniMind models that have undergone extended training, exploring longer text
 > sequences, larger volumes (such as 0.1B+), or more extensive datasets.<br/>
 > Each contribution is unique, and all attempts are valuable and encouraged.<br/>
-> Any shared contributions will be promptly recognized and compiled in the acknowledgments list. Thank you once again for everyone's support!
+> Any shared contributions will be promptly recognized and compiled in the acknowledgments list. Thank you once again
 > for everyone's support!
 ## 🤝[Contributors](https://github.com/jingyaogong/minimind/graphs/contributors)
@ -817,7 +836,6 @@ your model with third-party UIs, such as fastgpt, OpenWebUI, etc.
 </details>
 ## 🫶Supporter
 <a href="https://github.com/jingyaogong/minimind/stargazers">
@ -842,8 +860,6 @@ your model with third-party UIs, such as fastgpt, OpenWebUI, etc.
  <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=jingyaogong/minimind&type=Date"/>
 </picture>
 # License
 This repository is licensed under the [Apache-2.0 License](LICENSE).