update readme

This commit is contained in:
gongjy 2024-09-12 22:07:41 +08:00
parent edb26ced7b
commit 4ceaa04d05
2 changed files with 21 additions and 21 deletions

View File

@ -214,25 +214,25 @@ streamlit run fast_inference.py
因为LLM体积非常小为了避免模型头重脚轻词嵌入embedding层参数占整个LLM比太高所以词表长度需要选择比较小。 因为LLM体积非常小为了避免模型头重脚轻词嵌入embedding层参数占整个LLM比太高所以词表长度需要选择比较小。
强大的开源模型例如01万物、千问、chatglm、mistral、Llama3等它们的tokenizer词表长度如下 强大的开源模型例如01万物、千问、chatglm、mistral、Llama3等它们的tokenizer词表长度如下
| Tokenizer 模型 | 词表大小 | 来源 | | Tokenizer 模型 | 词表大小 | 来源 |
|--------------------|---------|----------------| |--------------------|---------|----------------|
| yi tokenizer | 64,000 | 01万物中国 | | yi tokenizer | 64,000 | 01万物中国 |
| qwen2 tokenizer | 151,643 | 阿里云(中国) | | qwen2 tokenizer | 151,643 | 阿里云(中国) |
| glm tokenizer | 151,329 | 智谱AI中国 | | glm tokenizer | 151,329 | 智谱AI中国 |
| mistral tokenizer | 32,000 | Mistral AI法国 | | mistral tokenizer | 32,000 | Mistral AI法国 |
| llama3 tokenizer | 128,000 | Meta美国 | | llama3 tokenizer | 128,000 | Meta美国 |
| minimind tokenizer | 6,400 | 自定义 | | minimind tokenizer | 6,400 | 自定义 |
> 尽管Mistral中文词语占比很少编解码效率弱于qwen2、glm等中文友好型分词器。 > 尽管Mistral中文词语占比很少编解码效率弱于qwen2、glm等中文友好型分词器。
> 但MiniMind这里选择了mistral tokenizer作为分词器以保持整体参数轻量避免头重脚轻因为mistral的词表大小只有32,000。 > 但MiniMind这里选择了mistral tokenizer作为分词器以保持整体参数轻量避免头重脚轻因为mistral的词表大小只有32,000。
> 且MiniMind在实际测试中几乎没有出现过生僻词汇解码失败的情况效果良好。 > 且MiniMind在实际测试中几乎没有出现过生僻词汇解码失败的情况效果良好。
> 方便对比测试效果额外训练了一个自定义Tokenizer模型的版本**MiniMind-small-T**自定义词表压缩长度到6400使得LLM总参数进一步降低到26M左右。 > 方便对比测试效果额外训练了一个自定义Tokenizer模型的版本**MiniMind-small-T**自定义词表压缩长度到6400使得LLM总参数进一步降低到26M左右。
--- ---
- 📙【Pretrain数据】 - 📙【Pretrain数据】
[seq-monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) [Seq-Monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Seq-Monkey百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
是由多种公开来源的数据如网页、百科、博客、开源代码、书籍等汇总清洗而成。整理成统一的JSONL格式并经过了严格的筛选和去重确保数据的全面性、规模、可信性和高质量。总量大约在10B 是由多种公开来源的数据如网页、百科、博客、开源代码、书籍等汇总清洗而成。整理成统一的JSONL格式并经过了严格的筛选和去重确保数据的全面性、规模、可信性和高质量。总量大约在10B
token适合中文大语言模型的预训练。 token适合中文大语言模型的预训练。
@ -271,7 +271,7 @@ streamlit run fast_inference.py
| MiniMind训练数据集 | 下载地址 | | MiniMind训练数据集 | 下载地址 |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------| |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **【tokenizer训练集】** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [百度网盘](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) | | **【tokenizer训练集】** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [百度网盘](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
| **【Pretrain数据】** | [seq-monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) | | **【Pretrain数据】** | [Seq-Monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) |
| **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) | | **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
| **【DPO数据】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) | | **【DPO数据】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
| **【DPO数据】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) | | **【DPO数据】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |

View File

@ -75,7 +75,8 @@ The project includes:
- Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA - Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA
fine-tuning, and DPO preference optimization, along with datasets and sources. fine-tuning, and DPO preference optimization, along with datasets and sources.
- Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`. - Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`.
- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and resuming at any - Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and
resuming at any
point. point.
- Code for testing the model on the Ceval dataset. - Code for testing the model on the Ceval dataset.
- Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat - Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat
@ -223,7 +224,6 @@ git clone https://github.com/jingyaogong/minimind.git
deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
``` ```
# 📌 Data sources # 📌 Data sources
- 🤖 Tokenizer: In NLP, a Tokenizer is similar to a dictionary, mapping words from natural language to numbers like 0, 1, - 🤖 Tokenizer: In NLP, a Tokenizer is similar to a dictionary, mapping words from natural language to numbers like 0, 1,
@ -245,7 +245,7 @@ git clone https://github.com/jingyaogong/minimind.git
sizes: sizes:
| Tokenizer Model | Vocabulary Size | Source | | Tokenizer Model | Vocabulary Size | Source |
|----------------------|------------------|-----------------------| |----------------------|------------------|-----------------------|
| yi tokenizer | 64,000 | 01-AI (China) | | yi tokenizer | 64,000 | 01-AI (China) |
| qwen2 tokenizer | 151,643 | Alibaba Cloud (China) | | qwen2 tokenizer | 151,643 | Alibaba Cloud (China) |
| glm tokenizer | 151,329 | Zhipu AI (China) | | glm tokenizer | 151,329 | Zhipu AI (China) |
@ -264,7 +264,7 @@ git clone https://github.com/jingyaogong/minimind.git
--- ---
- 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**: - 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**:
The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books, is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books,
etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and
deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is
@ -307,7 +307,7 @@ git clone https://github.com/jingyaogong/minimind.git
| MiniMind Training Dataset | Download Link | | MiniMind Training Dataset | Download Link |
|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| |---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **[tokenizer Data]** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [Baidu](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) | | **[tokenizer Data]** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [Baidu](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
| **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) | | **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) |
| **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) | | **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
| **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) | | **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
| **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) | | **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |