update readme info

2024-09-27 16:38:18 +08:00 · 2024-09-27 16:38:18 +08:00 · 1cc73836d4
commit 1cc73836d4
parent a8ae342775
2 changed files with 4 additions and 6 deletions
--- a/README.md
+++ b/README.md
@ -80,7 +80,7 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
 <details close> 
 <summary> <b>2024-09-27</b> </summary>

- 09-27更新pretrain数据集的预处理方式，为了保证文本完整性，放弃预处理成.bin训练的形式（轻微牺牲训练速度）。
+- 👉09-27更新pretrain数据集的预处理方式，为了保证文本完整性，放弃预处理成.bin训练的形式（轻微牺牲训练速度）。

 - 目前pretrain预处理后的文件命名为：pretrain_data.csv。

@ -252,8 +252,7 @@ streamlit run fast_inference.py
      <tr><td>minimind tokenizer</td><td>6,400</td><td>自定义</td></tr>
    </table>

-  > [!TIP]
-  > 2024-09-17更新：为了防止过去的版本歧义&控制体积，minimind所有模型均使用minimind_tokenizer分词，废弃所有mistral_tokenizer版本。
+  > 👉2024-09-17更新：为了防止过去的版本歧义&控制体积，minimind所有模型均使用minimind_tokenizer分词，废弃所有mistral_tokenizer版本。

  > 尽管minimind_tokenizer长度很小，编解码效率弱于qwen2、glm等中文友好型分词器。
  > 但minimind模型选择了自己训练的minimind_tokenizer作为分词器，以保持整体参数轻量，避免编码层和计算层占比失衡，头重脚轻，因为minimind的词表大小只有6400。
--- a/README_en.md
+++ b/README_en.md
@ -87,7 +87,7 @@ We hope this open-source project helps LLM beginners get started quickly!
 <details close> 
 <summary> <b>2024-09-27</b> </summary>

- Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed).
+- 👉Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed).

 - The current filename for the pretrain data after preprocessing is: pretrain_data.csv.

@ -282,8 +282,7 @@ git clone https://github.com/jingyaogong/minimind.git
      <tr><td>minimind tokenizer</td><td>6,400</td><td>Custom</td></tr>
    </table>

-  > [!IMPORTANT]
-  > Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now
+  > 👉Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now
  use the Minimind_tokenizer for tokenization, and all versions of the Mistral_tokenizer have been deprecated.

  > Although the Minimind_tokenizer has a small length and its encoding/decoding efficiency is weaker compared to