update rlhf

2024-10-15 10:30:32 +08:00 · 2024-10-15 10:30:32 +08:00 · 11d5cadb9c
commit 11d5cadb9c
parent 02adb7bc0d
2 changed files with 144 additions and 36 deletions
--- a/README.md
+++ b/README.md
@ -27,7 +27,9 @@

 * 本开源项目旨在完全从0开始，最快仅用3小时！即可训练出仅为26.88M大小的微型语言模型**MiniMind**。
 * **MiniMind**极其轻量，最小版本体积约是 GPT3 的 $\frac{1}{7000}$，力求做到最普通的个人GPU也可快速推理甚至训练。
-* **MiniMind**发布了大模型极简结构，数据集清洗和预处理、监督预训练(Pretrain)、有监督指令微调(SFT)、低秩自适应(LoRA)微调，无奖励强化学习直接偏好对齐(DPO)的全阶段代码，也包含拓展共享混合专家(MoE)的稀疏模型；拓展视觉多模态VLM: [MiniMind-V](https://github.com/jingyaogong/minimind-v)。
+* **MiniMind**发布了大模型极简结构，数据集清洗和预处理、监督预训练(Pretrain)、有监督指令微调(SFT)、低秩自适应(LoRA)
+  微调，无奖励强化学习直接偏好对齐(DPO)的全阶段代码，也包含拓展共享混合专家(MoE)
+  的稀疏模型；拓展视觉多模态VLM: [MiniMind-V](https://github.com/jingyaogong/minimind-v)。
 * 这不仅是一个开源模型的实现，也是入门大语言模型（LLM）的教程。
 * 希望此项目能为研究者提供一个抛砖引玉的入门示例，帮助大家快速上手并对LLM领域产生更多的探索与创新。

@ -42,7 +44,7 @@

 ![streamlit](./images/streamlit.gif)

-[ModelScope在线测试](https://www.modelscope.cn/studios/gongjy/minimind) | [Bilibili视频链接](https://www.bilibili.com/video/BV12dHPeqE72/?share_source=copy_web&vd_source=670c2504f88726f8cf4a21ef6147c0e8) 
+[ModelScope在线测试](https://www.modelscope.cn/studios/gongjy/minimind) | [Bilibili视频链接](https://www.bilibili.com/video/BV12dHPeqE72/?share_source=copy_web&vd_source=670c2504f88726f8cf4a21ef6147c0e8)

 ---

@ -180,7 +182,6 @@ python 2-eval.py
 streamlit run fast_inference.py
 ```

-
 # 📌 Quick Start Train

 * 0、克隆项目代码
@ -193,7 +194,7 @@ streamlit run fast_inference.py
  ```bash
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
  ```
-  
+
  ```text
  # 测试torch是否可用cuda
  import torch
@ -407,15 +408,20 @@ minimind目前训练的模型版本见下表：
    - 我们仅需使用数据集的history_chat 字段，即历史对话，以及history_chat_response字段，即历史对话的回答。
    - 构建【问题->回答，问题->回答，问题->】的新聊天模板，然后使用这个数据集进行微调。
    - 学习完成的模型不仅仅只能回答当前问题，还能根据历史对话进行连贯的对话。
-    - 这一步并非必须，因为小模型长上文对话能力很弱，强行对齐多轮问答模板会损失一定程度的单轮SFT效果。
+    - 这一步 **并非必须** ，因为小模型长上文对话能力很弱，强行对齐多轮问答模板会损失一定程度的单轮SFT效果。
   > 学习率设置为1e-5到1e-6的动态学习率，微调epoch数为5。
    ```bash
    # 3-full_sft.py中设置数据集为sft_data.csv
    torchrun --nproc_per_node 2 3-full_sft.py
    ```
-4. **直接偏好优化，强化学习微调(Direct Preference Optimization, DPO)**:
-    - 在前面的训练中，机器人已经具备了基本的对话能力。但是，我们希望它能够更符合人的偏好，给出更让人满意的回答。
-    - 这个过程就像是让机器人参加工作培训，从优秀员工的作为例子，消极员工作为反例，学习如何更好地服务客户。
+4. **人类反馈强化学习(RLHF)之-直接偏好优化(Direct Preference Optimization, DPO)**:
+    - 在前面的训练中，GPT已经具备了基本的对话能力，但是这样的能力完全基于单词接龙，缺少正例反例的激励。
+    - GPT尚且未知什么回答是好的，什么是差的。我们希望它能够更符合人的偏好，给出更让人满意的回答。
+    - 这个过程就像是让GPT参加工作培训，从优秀员工的作为例子，消极员工作为反例，学习如何更好地服务客户。
+    - RLHF系列中，与PPO(Proximal Policy Optimization)这种需要奖励模型、价值模型的RL算法不同；
+    - DPO通过推导PPO奖励模型的显式解，把在线奖励模型换成离线数据，ref输出可以提前保存。
+    - DPO性能几乎不变，只用跑 actor 和 ref 2 个模型，大大节省显存开销和增加训练稳定性。
+    - 同样的，LLM的RL步骤也 **并非必须**，有利也有弊。
   > 活字三元组(q,chose,reject)数据集，学习率le-5，半精度fp16,共1个epoch，耗时1h。
    ```bash
    python 5-dpo_train.py
@ -454,18 +460,61 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长

 [百度网盘](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)

-| Model Name        | params | Config                      | pretrain_model                                                 | single_sft_model                                               | multi_sft_model                                                |
-|-------------------|--------|-----------------------------|----------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|
-| minimind-v1-small | 26M    | d_model=512<br/>n_layers=8  | [链接](https://pan.baidu.com/s/1wP_cAIc8cgaJ6CxUmR9ECQ?pwd=6666) | [链接](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [链接](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |
-| minimind-v1-moe   | 4×26M  | d_model=512<br/>n_layers=8  | [链接](https://pan.baidu.com/s/1IZdkzPRhbZ_bSsRL8vInjg?pwd=6666) | [链接](https://pan.baidu.com/s/1tqB-GMvuiGQBvEl-yZ-oBw?pwd=6666) | [链接](https://pan.baidu.com/s/1GHJ2T4904EcT1u8l1rVqtg?pwd=6666) |
-| minimind-v1       | 108M   | d_model=768<br/>n_layers=16 | [链接](https://pan.baidu.com/s/1B60jYo4T8OmJI0ooqsixaA?pwd=6666) | [链接](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [链接](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) |
+| Model Name        | params | Config                      | pretrain_model                                                 | single_sft_model                                               | multi_sft_model                                                | rl_model                                                       |
+|-------------------|--------|-----------------------------|----------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|
+| minimind-v1-small | 26M    | d_model=512<br/>n_layers=8  | [链接](https://pan.baidu.com/s/1wP_cAIc8cgaJ6CxUmR9ECQ?pwd=6666) | [链接](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [链接](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) | [链接](https://pan.baidu.com/s/1C_dOCzNxr_XF3Qk3pkdrwg?pwd=6666) |
+| minimind-v1-moe   | 4×26M  | d_model=512<br/>n_layers=8  | [链接](https://pan.baidu.com/s/1IZdkzPRhbZ_bSsRL8vInjg?pwd=6666) | [链接](https://pan.baidu.com/s/1tqB-GMvuiGQBvEl-yZ-oBw?pwd=6666) | [链接](https://pan.baidu.com/s/1GHJ2T4904EcT1u8l1rVqtg?pwd=6666) | -                                                              |
+| minimind-v1       | 108M   | d_model=768<br/>n_layers=16 | [链接](https://pan.baidu.com/s/1B60jYo4T8OmJI0ooqsixaA?pwd=6666) | [链接](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [链接](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) | [链接](https://pan.baidu.com/s/1vmUrir-UuucqBftqNPI4ng?pwd=6666) |

 ---

 # 📌 Eval

+## ① RLHF篇
+
+> [!TIP]
+> 测试基于「单轮对话full_sft」和「DPO强化学习对齐」的minimind模型对比。
+
+模型文件[百度网盘](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)，其中 `rl_<dim>.pth` 即为「DPO强化学习对齐」后的minimind模型权重。
+
+```text
+[Q]: 你叫什么名字？
+[ft_768]: 我是一款人工智能语言模型，可以回答您的问题和提供帮助。您有什么需要我帮忙的吗？
+[rl_768]: 我是一款基于人工智能技术的自然语言处理系统，主要用于处理和生成人类语言的任务。我可以帮助用户进行文本处理、语音识别、文本生成等任务，同时也可以帮助用户进行智能聊天、翻译、翻译等多种应用。如果您需要更多关于该项目的信息，请告诉我。
+[ft_512]: 我很高兴能够帮助你。你有什么其他问题需要我帮忙解答吗？
+[rl_512]: 我是一款基于自然语言处理技术的聊天机器人，能够理解和回复用户的问题和指令，并进行相应的回复和回复。我被训练为能够与人类进行交互和回答各种问题和任务。请问有什么需要吗？
+
+[Q]: 你是谁？
+[ft_768]: 我是一个人工智能语言模型，没有个人身份和身份。我只是根据输入的指令和文本信息生成响应。
+[rl_768]: 我是一个计算机程序，没有个人身份或身份。我被设计来回答各种问题和提供帮助，包括回答问题、提供建议、进行实验、提供信息等等。我被广泛应用于各种领域，例如自然语言处理、计算机视觉、人工智能等等。如果您需要更多的帮助或信息，请随时与我联系。我会尽我所能来帮助您的工作和生活。
+[ft_512]: 我是一个大型语言模型，可以回答各种各样的问题，包括科学、历史、文化和技术方面的问题。您有什么需要我回答的问题吗？
+[rl_512]: 我是一个计算机程序，没有具体的身世和人物。如果您可以告诉我您想知道的具体信息，我会尽力回答您的问题。
+
+[Q]: 中国有哪些比较好的大学？
+[ft_768]: 中国有很多著名的大学，其中包括北京大学、上海交通大学、复兴中医药学院、浙江大学等。这些大学在不同的领域都有着自己的标志性建筑和研究方向，值得一提的是，它们在中国的顶尖教育资源中享有很高的声誉。
+[rl_768]: 中国的大学有很多，具体情况还要根据你的需求来决定。如果你需要更详细的信息，建议可以去咨询一下你所在城市的大学排名和政策法规，从而选择一个合适的大学。同时，你可以在网上查阅相关信息，或者去咨询一些大学相关机构，如中国人民大学、中国人民大学等，以获得更为详细和专业的指导和建议。
+[ft_512]: 中国有很多优秀的大学，比如麻省理工学院、北京大学、上海大学等等。这些大学在中国教育事业的发展和成就都非常突出。
+[rl_512]: 中国教育局、中国大学、中国大陆大学和中国大陆大学均有优秀的教育体裁，但各地教育水平和专业要求都可能存在差异。中国教育局的学历要求、教学方法、师资力量等因素都各具特点，而大学的学历要求也各有千秋。中国大陆大学生的学历要求也各有不同，需要根据自己的实际情况进行选择。如果您对中国大陆大学生的专业要求有更具体需求，建议您可以通过互联网、国内外的各类机构和机构进行比较和交流。同时，您也可以向身边的人或朋友、老师、家长等人士寻求帮助和建议。希望以上回答对您有帮助。
+
+[Q]: 你知道光速是多少吗？
+[ft_768]: 光速是一个物理常数，表示光在真空中传播的速度，大约是每秒299792458米。光速是物理学中最基本的物理常数之一，也是相对论的重要基础之一。
+[rl_768]: 当然知道，光速是光在真空中传播的速度，通常用c来表示。它是宇宙物理学中非常重要的理论之一。在物理学中，光速被认为是宇宙中最重要的速度之一。它代表着光在真空中传播的速度，这也是许多重要公式和公式的基础。
+[ft_512]: 光速是光在真空中传播的速度，约为每秒299,792,458米。它是几乎所有的物理学概念，也是宇宙中最神秘和最重要的物理概念之一。
+[rl_512]: 光速是光在真空中传播的速度，是自然界中不可逆转的速度。光源是由光的原子、分子、固体和光的组成，其速度约为每秒平方。光的速度是每秒约1个英亞或光年，这个数字是根据相对论、量子力学和引力理论来计算的。光速是相对的概念之一，它代表着我们对自然界中任何事件的理解和解释。
+```
+
+### 👉效果总结
+* RLHF数据使用大约10万条；full_sft模型在简洁性和信息准确性方面表现更好；rl模型在回答中提供了更多的背景信息，但信息准确性有待改进。
+* 总的来说RLHF后的模型倾向于学习：说更多有礼貌但无用的废话讨好“对话”本身，而对信息准确性则有轻微损失。
+* 天下没有免费的午餐，还需要继续提升RLHF数据集的质量，也要接受模型能力无法避免的损失(程度有轻重)。
+* DPO和在线PPO的区别在于reject和chosen都是离线准备的，和minimind模型本身的输出必然存在很大的分布差异。
+* 这类似于DPO算法使模型观看乒乓球世界冠军的打法「录像」进行强化学习，而不是像PPO一样请reward模型做「教练」纠正自己的打法强化学习。
+
+## ② Instruct Fine-Tuning 篇
+
 > [!TIP]
 > 以下测试于2024-09-17完成，此日期后发布的新模型，无特殊需要时将不加入测试。
+> 测试基于单轮对话full_sft的minimind模型(无多轮微调和强化学习微调)。

 [A] [minimind-v1-small(0.02B)](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666)<br/>
 [B] [minimind-v1-moe(0.1B)](https://pan.baidu.com/s/1tqB-GMvuiGQBvEl-yZ-oBw?pwd=6666)<br/>
@ -565,7 +614,7 @@ MobileLLM提出架构的深度比宽度更重要，「深而窄」的「瘦长

 ---

-## 👉效果总结
+### 👉效果总结

 * minimind系列（ABC）的排序符合直觉，minimind-v1(0.1B)评分最高，常识性问题的回答基本没有错误和幻觉。
    * 出乎意料的是，minimind-v1-small(0.02B)仅有26M参数，却可以接近minimind-v1(0.1B)的表现。
--- a/README_en.md
+++ b/README_en.md
@ -26,17 +26,26 @@

 </div>

-* This open-source project aims to train a tiny language model called **MiniMind** from scratch in just 3 hours, with a model size of only 26.88M.
+* This open-source project aims to train a tiny language model called **MiniMind** from scratch in just 3 hours, with a
+  model size of only 26.88M.

-* **MiniMind** is extremely lightweight, with the smallest version being approximately $\frac{1}{7000}$ the size of GPT3, making it possible for even an ordinary personal GPU to perform quick inference and even training.
+* **MiniMind** is extremely lightweight, with the smallest version being approximately $\frac{1}{7000}$ the size of
+  GPT3, making it possible for even an ordinary personal GPU to perform quick inference and even training.

-* **MiniMind** provides the full-stage code for a simplified large model structure, dataset cleaning and preprocessing, supervised pretraining, supervised instruction fine-tuning (SFT), low-rank adaptation (LoRA) fine-tuning, and direct preference alignment with reinforcement learning without rewards (DPO). It also includes code for expanding to sparse models with mixed experts (MoE) and multi-modal vision language models (VLM): [MiniMind-V](https://github.com/jingyaogong/minimind-v).
+* **MiniMind** provides the full-stage code for a simplified large model structure, dataset cleaning and preprocessing,
+  supervised pretraining, supervised instruction fine-tuning (SFT), low-rank adaptation (LoRA) fine-tuning, and direct
+  preference alignment with reinforcement learning without rewards (DPO). It also includes code for expanding to sparse
+  models with mixed experts (MoE) and multi-modal vision language models (
+  VLM): [MiniMind-V](https://github.com/jingyaogong/minimind-v).

-* This is not just an implementation of an open-source model but also a tutorial for getting started with large language models (LLM).
+* This is not just an implementation of an open-source model but also a tutorial for getting started with large language
+  models (LLM).

-* We hope this project will serve as an introductory example for researchers, helping them quickly get started and inspiring more exploration and innovation in the LLM field.
+* We hope this project will serve as an introductory example for researchers, helping them quickly get started and
+  inspiring more exploration and innovation in the LLM field.

-> To avoid misinterpretation, "fastest 3 hours" means you need a machine with hardware configuration superior to mine. Detailed specifications will be provided below.
+> To avoid misinterpretation, "fastest 3 hours" means you need a machine with hardware configuration superior to mine.
+> Detailed specifications will be provided below.

 ---

@ -44,13 +53,12 @@

 ![streamlit](./images/streamlit.gif)

-[ModelScope Online Testing](https://www.modelscope.cn/studios/gongjy/minimind) | [Bilibili Video Link](https://www.bilibili.com/video/BV12dHPeqE72/?share_source=copy_web&vd_source=670c2504f88726f8cf4a21ef6147c0e8) 
+[ModelScope Online Testing](https://www.modelscope.cn/studios/gongjy/minimind) | [Bilibili Video Link](https://www.bilibili.com/video/BV12dHPeqE72/?share_source=copy_web&vd_source=670c2504f88726f8cf4a21ef6147c0e8)

 ---

 </div>

-
 # 📌 Introduction

 In the field of large language models (LLMs) such as GPT, LLaMA, GLM, etc., while their performance is impressive, the
@ -197,13 +205,13 @@ streamlit run fast_inference.py
  git clone https://github.com/jingyaogong/minimind.git
  cd minimind
  ```
-  
+
 * 1.Install the required dependencies

  ```bash
    pip install -r requirements.txt
  ```
-  
+
  ```text
  # Test if torch can use CUDA
  import torch
@ -211,8 +219,9 @@ streamlit run fast_inference.py
  ```

  > If it is not available, please go to [torch_stable](https://download.pytorch.org/whl/torch_stable.html)
-  to download the whl file for installation. Refer to [this link](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=安装torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
-  
+  to download the whl file for installation. Refer
+  to [this link](https://blog.csdn.net/weixin_45456738/article/details/141029610?ops_request_misc=&request_id=&biz_id=102&utm_term=安装torch&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-2-141029610.nonecase&spm=1018.2226.3001.4187)
+
 * 2.If you need to train the model yourself

    * 2.1 Download the [dataset download link](#dataset-download-links) and place it in the `./dataset` directory.
@ -458,10 +467,18 @@ shown in the table below:
    ```

 4. **Direct Preference Optimization (DPO)**:
-    - After the previous training steps, the model has basic conversational abilities. However, we want it to align more
-      closely with human preferences and provide more satisfactory responses.
-    - This process is similar to workplace training for the model, where it learns from examples of excellent employees
-      and negative examples to better serve customers.
+    - In previous training sessions, GPT has already acquired basic conversational abilities, but these abilities are
+      entirely based on word-by-word concatenation, lacking the motivation of positive and negative examples.
+    - GPT is still unaware of what constitutes a good response and what constitutes a poor one. We hope it can align
+      more with human preferences and provide more satisfying responses.
+    - This process is akin to training GPT in a workplace setting, learning from the examples of outstanding employees
+      and the mistakes of underperforming ones, to better serve customers.
+    - In the RLHF series, unlike PPO (Proximal Policy Optimization), which requires reward models and value models,
+    - DPO derives an explicit solution for the PPO reward model, replacing the online reward model with offline data,
+      where ref outputs can be saved in advance.
+    - DPO maintains nearly the same performance, requiring only the actor and ref models to run, significantly reducing
+      memory overhead and increasing training stability.
+    - Similarly, the RL steps for LLM are **not mandatory**, with both advantages and disadvantages.
   > For the Huozi trio (q, chose, reject) dataset, the learning rate is set to 1e-5, with half-precision fp16, 1 epoch,
   and it takes about 1 hour.
    ```bash
@ -505,15 +522,57 @@ better with the scaling law for small models.

 [baidu](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666)

-| Model Name        | params | Config                      | pretrain_model                                                  | single_sft_model                                                | multi_sft_model                                                 |
-|-------------------|--------|-----------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|
-| minimind-v1-small | 26M    | d_model=512<br/>n_layers=8  | [URL](https://pan.baidu.com/s/1wP_cAIc8cgaJ6CxUmR9ECQ?pwd=6666) | [URL](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [URL](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |
-| minimind-v1-moe   | 4×26M  | d_model=512<br/>n_layers=8  | [URL](https://pan.baidu.com/s/1IZdkzPRhbZ_bSsRL8vInjg?pwd=6666) | [URL](https://pan.baidu.com/s/1tqB-GMvuiGQBvEl-yZ-oBw?pwd=6666) | [URL](https://pan.baidu.com/s/1GHJ2T4904EcT1u8l1rVqtg?pwd=6666) |
-| minimind-v1       | 108M   | d_model=768<br/>n_layers=16 | [URL](https://pan.baidu.com/s/1B60jYo4T8OmJI0ooqsixaA?pwd=6666) | [URL](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [URL](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) |
+| Model Name        | params | Config                      | pretrain_model                                                  | single_sft_model                                                | multi_sft_model                                                 | rl_model |
+|-------------------|--------|-----------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|----------|
+| minimind-v1-small | 26M    | d_model=512<br/>n_layers=8  | [URL](https://pan.baidu.com/s/1wP_cAIc8cgaJ6CxUmR9ECQ?pwd=6666) | [URL](https://pan.baidu.com/s/1_COe0FQRDmeapSsvArahCA?pwd=6666) | [URL](https://pan.baidu.com/s/1GsGsWSL0Dckl0YPRXiBIFQ?pwd=6666) |          | [URL](https://pan.baidu.com/s/1C_dOCzNxr_XF3Qk3pkdrwg?pwd=6666) |
+| minimind-v1-moe   | 4×26M  | d_model=512<br/>n_layers=8  | [URL](https://pan.baidu.com/s/1IZdkzPRhbZ_bSsRL8vInjg?pwd=6666) | [URL](https://pan.baidu.com/s/1tqB-GMvuiGQBvEl-yZ-oBw?pwd=6666) | [URL](https://pan.baidu.com/s/1GHJ2T4904EcT1u8l1rVqtg?pwd=6666) |          | -                                                              |
+| minimind-v1       | 108M   | d_model=768<br/>n_layers=16 | [URL](https://pan.baidu.com/s/1B60jYo4T8OmJI0ooqsixaA?pwd=6666) | [URL](https://pan.baidu.com/s/1p713loS7EfwHQf3G9eYI3Q?pwd=6666) | [URL](https://pan.baidu.com/s/12iHGpAs6R0kqsOnGtgK6vQ?pwd=6666) |          | [URL](https://pan.baidu.com/s/1vmUrir-UuucqBftqNPI4ng?pwd=6666) |

 ---

-# 📌 Eval
+# 📌 Evaluation
+
+### ① RLHF Section
+
+> [!TIP]
+> Testing based on the comparison of the "single-turn dialogue full_sft" and "DPO reinforcement learning alignment" versions of the minimind model.
+
+Model files are available on [Baidu Netdisk](https://pan.baidu.com/s/1KUfSzEkSXYbCCBj0Pw-9fA?pwd=6666), where `rl_<dim>.pth` represents the weights of the minimind model after "DPO reinforcement learning alignment."
+
+```text
+[Q]: What is your name?
+[ft_768]: I am an artificial intelligence language model, capable of answering your questions and providing assistance. Do you need any help?
+[rl_768]: I am a natural language processing system based on artificial intelligence technology, primarily used for processing and generating human language tasks. I can assist users in text processing, speech recognition, text generation, and various other tasks. I can also help with intelligent chat, translation, and many other applications. If you need more information about this project, please let me know.
+[ft_512]: I am glad to help you. Do you have any other questions you need me to answer?
+[rl_512]: I am a chatbot based on natural language processing technology, capable of understanding and responding to user questions and instructions. I am trained to interact with humans and answer various questions and tasks. Is there anything you need?
+
+[Q]: Who are you?
+[ft_768]: I am an artificial intelligence language model, without personal identity or status. I only generate responses based on the input instructions and text information.
+[rl_768]: I am a computer program, without personal identity or status. I am designed to answer various questions and provide assistance, including answering questions, offering advice, conducting experiments, providing information, and more. I am widely used in various fields such as natural language processing, computer vision, and artificial intelligence. If you need more help or information, please feel free to contact me. I will do my best to assist you in your work and life.
+[ft_512]: I am a large language model, capable of answering a wide variety of questions, including those related to science, history, culture, and technology. Do you have any questions you need me to answer?
+[rl_512]: I am a computer program, without a specific background or identity. If you can tell me the specific information you want to know, I will do my best to answer your questions.
+
+[Q]: What are some good universities in China?
+[ft_768]: China has many renowned universities, including Peking University, Shanghai Jiao Tong University, Fuxing Chinese Medicine College, Zhejiang University, and more. These universities have their own iconic buildings and research directions in different fields, and they enjoy a high reputation in China's top educational resources.
+[rl_768]: There are many universities in China, and the choice depends on your needs. If you need more detailed information, it is recommended to consult the university rankings and policies in your city to choose a suitable university. You can also search for relevant information online or consult some university-related institutions such as Renmin University of China, Renmin University of China, etc., to obtain more detailed and professional guidance and advice.
+[ft_512]: China has many excellent universities, such as MIT, Peking University, Shanghai University, and more. These universities have made significant contributions to the development and achievements of China's education.
+[rl_512]: The Chinese Ministry of Education, Chinese universities, mainland Chinese universities, and mainland Chinese universities all have excellent educational systems, but the educational levels and professional requirements may vary. The Chinese Ministry of Education has different academic requirements, teaching methods, and faculty strengths, and the academic requirements of universities also vary. The academic requirements for mainland Chinese students also differ, and you need to choose based on your actual situation. If you have more specific needs for the professional requirements of mainland Chinese students, it is recommended to compare and communicate through the internet, various domestic and international institutions, and organizations. You can also seek help and advice from people around you, such as friends, teachers, and parents. I hope the above answer is helpful to you.
+
+[Q]: Do you know what the speed of light is?
+[ft_768]: The speed of light is a physical constant, representing the speed at which light travels in a vacuum, approximately 299,792,458 meters per second. The speed of light is one of the most fundamental physical constants in physics and is also an important basis for relativity.
+[rl_768]: Of course, I know. The speed of light is the speed at which light travels in a vacuum, usually denoted by c. It is a very important theory in cosmology. In physics, the speed of light is considered one of the most important speeds in the universe. It represents the speed at which light travels in a vacuum, which is the basis for many important formulas and equations.
+[ft_512]: The speed of light is the speed at which light travels in a vacuum, approximately 299,792,458 meters per second. It is one of the most mysterious and important physical concepts in the universe.
+[rl_512]: The speed of light is the speed at which light travels in a vacuum, an irreversible speed in nature. The source of light consists of atoms, molecules, solids, and light, with a speed of approximately one light year per second. This number is calculated based on relativity, quantum mechanics, and gravitational theory. The speed of light is a relative concept, representing our understanding and interpretation of any event in nature.
+```
+
+### 👉Effect Summary
+* Approximately 100,000 RLHF data samples were used; the full_sft model performs better in terms of conciseness and information accuracy; the rl model provides more background information in its responses, but the accuracy of the information needs improvement.
+* Overall, the model after RLHF tends to learn: to say more polite but useless "fluff" to please the "conversation" itself, while slightly sacrificing information accuracy.
+* There is no such thing as a free lunch; we need to continue to improve the quality of the RLHF dataset, and we must also accept the inevitable loss of model capabilities (with varying degrees of severity).
+* The difference between DPO and online PPO is that reject and chosen are prepared offline, which inevitably creates a large distribution difference with the output of the minimind model itself.
+* This is similar to the DPO algorithm making the model watch the "replay" of the table tennis world champion's gameplay for reinforcement learning, rather than having the reward model act as a "coach" to correct its gameplay in real-time, like PPO.
+
+## ② Instruct Fine-Tuning Section

 > [!TIP]
 > The following tests were completed on September 17, 2024. New models released after this date will not be included in