update readme info
This commit is contained in:
parent
a8ae342775
commit
1cc73836d4
@ -80,7 +80,7 @@ https://github.com/user-attachments/assets/88b98128-636e-43bc-a419-b1b1403c2055
|
||||
<details close>
|
||||
<summary> <b>2024-09-27</b> </summary>
|
||||
|
||||
- 09-27更新pretrain数据集的预处理方式,为了保证文本完整性,放弃预处理成.bin训练的形式(轻微牺牲训练速度)。
|
||||
- 👉09-27更新pretrain数据集的预处理方式,为了保证文本完整性,放弃预处理成.bin训练的形式(轻微牺牲训练速度)。
|
||||
|
||||
- 目前pretrain预处理后的文件命名为:pretrain_data.csv。
|
||||
|
||||
@ -252,8 +252,7 @@ streamlit run fast_inference.py
|
||||
<tr><td>minimind tokenizer</td><td>6,400</td><td>自定义</td></tr>
|
||||
</table>
|
||||
|
||||
> [!TIP]
|
||||
> 2024-09-17更新:为了防止过去的版本歧义&控制体积,minimind所有模型均使用minimind_tokenizer分词,废弃所有mistral_tokenizer版本。
|
||||
> 👉2024-09-17更新:为了防止过去的版本歧义&控制体积,minimind所有模型均使用minimind_tokenizer分词,废弃所有mistral_tokenizer版本。
|
||||
|
||||
> 尽管minimind_tokenizer长度很小,编解码效率弱于qwen2、glm等中文友好型分词器。
|
||||
> 但minimind模型选择了自己训练的minimind_tokenizer作为分词器,以保持整体参数轻量,避免编码层和计算层占比失衡,头重脚轻,因为minimind的词表大小只有6400。
|
||||
|
@ -87,7 +87,7 @@ We hope this open-source project helps LLM beginners get started quickly!
|
||||
<details close>
|
||||
<summary> <b>2024-09-27</b> </summary>
|
||||
|
||||
- Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed).
|
||||
- 👉Updated the preprocessing method for the pretrain dataset on 09-27 to ensure text integrity, opting to abandon the preprocessing into .bin training format (slightly sacrificing training speed).
|
||||
|
||||
- The current filename for the pretrain data after preprocessing is: pretrain_data.csv.
|
||||
|
||||
@ -282,8 +282,7 @@ git clone https://github.com/jingyaogong/minimind.git
|
||||
<tr><td>minimind tokenizer</td><td>6,400</td><td>Custom</td></tr>
|
||||
</table>
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now
|
||||
> 👉Update on 2024-09-17: To avoid ambiguity from previous versions and control the model size, all Minimind models now
|
||||
use the Minimind_tokenizer for tokenization, and all versions of the Mistral_tokenizer have been deprecated.
|
||||
|
||||
> Although the Minimind_tokenizer has a small length and its encoding/decoding efficiency is weaker compared to
|
||||
|
Loading…
x
Reference in New Issue
Block a user