update readme info

This commit is contained in:
gongjy 2024-09-27 16:38:18 +08:00
parent a8ae342775
commit 1cc73836d4
2 changed files with 4 additions and 6 deletions

View File

@ -80,7 +80,7 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
<details close>
<summary> <b>2024-09-27</b> </summary>
- 09-27更新pretrain数据集的预处理方式为了保证文本完整性放弃预处理成.bin训练的形式轻微牺牲训练速度
- 👉09-27更新pretrain数据集的预处理方式为了保证文本完整性放弃预处理成.bin训练的形式轻微牺牲训练速度
- 目前pretrain预处理后的文件命名为pretrain_data.csv。
@ -252,8 +252,7 @@ streamlit run fast_inference.py
<tr><td>minimind tokenizer</td><td>6,400</td><td>自定义</td></tr>
</table>
> [!TIP]
> 2024-09-17更新为了防止过去的版本歧义&控制体积minimind所有模型均使用minimind_tokenizer分词废弃所有mistral_tokenizer版本。
> 👉2024-09-17更新为了防止过去的版本歧义&控制体积minimind所有模型均使用minimind_tokenizer分词废弃所有mistral_tokenizer版本。
> 尽管minimind_tokenizer长度很小编解码效率弱于qwen2、glm等中文友好型分词器。
> 但minimind模型选择了自己训练的minimind_tokenizer作为分词器以保持整体参数轻量避免编码层和计算层占比失衡头重脚轻因为minimind的词表大小只有6400。

View File

@ -87,7 +87,7 @@ We hope this open-source project helps LLM beginners get started quickly!
<details close>
<summary> <b>2024-09-27</b> </summary>
- Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed).
- 👉Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed).
- The current filename for the pretrain data after preprocessing is: pretrain_data.csv.
@ -282,8 +282,7 @@ git clone https://github.com/jingyaogong/minimind.git
<tr><td>minimind tokenizer</td><td>6,400</td><td>Custom</td></tr>
</table>
> [!IMPORTANT]
> Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now
> 👉Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now
use the Minimind_tokenizer for tokenization, and all versions of the Mistral_tokenizer have been deprecated.
> Although the Minimind_tokenizer has a small length and its encoding/decoding efficiency is weaker compared to