From 4ceaa04d05bdb2afd8deaff47a1d8d8c9de3a16f Mon Sep 17 00:00:00 2001 From: gongjy <2474590974@qq.com> Date: Thu, 12 Sep 2024 22:07:41 +0800 Subject: [PATCH] update readme --- README.md | 32 ++++++++++++++++---------------- README_en.md | 10 +++++----- 2 files changed, 21 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index a716b30..ac78220 100644 --- a/README.md +++ b/README.md @@ -214,25 +214,25 @@ streamlit run fast_inference.py 因为LLM体积非常小,为了避免模型头重脚轻(词嵌入embedding层参数占整个LLM比太高),所以词表长度需要选择比较小。 强大的开源模型例如01万物、千问、chatglm、mistral、Llama3等,它们的tokenizer词表长度如下: - | Tokenizer 模型 | 词表大小 | 来源 | - |--------------------|---------|----------------| - | yi tokenizer | 64,000 | 01万物(中国) | - | qwen2 tokenizer | 151,643 | 阿里云(中国) | - | glm tokenizer | 151,329 | 智谱AI(中国) | - | mistral tokenizer | 32,000 | Mistral AI(法国) | - | llama3 tokenizer | 128,000 | Meta(美国) | - | minimind tokenizer | 6,400 | 自定义 | - - > 尽管Mistral中文词语占比很少,编解码效率弱于qwen2、glm等中文友好型分词器。 - > 但MiniMind这里选择了mistral tokenizer作为分词器以保持整体参数轻量,避免头重脚轻,因为mistral的词表大小只有32,000。 - > 且MiniMind在实际测试中几乎没有出现过生僻词汇解码失败的情况,效果良好。 - - > 方便对比测试效果,额外训练了一个自定义Tokenizer模型的版本**MiniMind-small-T**,自定义词表压缩长度到6400,使得LLM总参数进一步降低到26M左右。 + | Tokenizer 模型 | 词表大小 | 来源 | + |--------------------|---------|----------------| + | yi tokenizer | 64,000 | 01万物(中国) | + | qwen2 tokenizer | 151,643 | 阿里云(中国) | + | glm tokenizer | 151,329 | 智谱AI(中国) | + | mistral tokenizer | 32,000 | Mistral AI(法国) | + | llama3 tokenizer | 128,000 | Meta(美国) | + | minimind tokenizer | 6,400 | 自定义 | + + > 尽管Mistral中文词语占比很少,编解码效率弱于qwen2、glm等中文友好型分词器。 + > 但MiniMind这里选择了mistral tokenizer作为分词器以保持整体参数轻量,避免头重脚轻,因为mistral的词表大小只有32,000。 + > 且MiniMind在实际测试中几乎没有出现过生僻词汇解码失败的情况,效果良好。 + + > 方便对比测试效果,额外训练了一个自定义Tokenizer模型的版本**MiniMind-small-T**,自定义词表压缩长度到6400,使得LLM总参数进一步降低到26M左右。 --- - 📙【Pretrain数据】: - [seq-monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) + [Seq-Monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Seq-Monkey百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) 是由多种公开来源的数据(如网页、百科、博客、开源代码、书籍等)汇总清洗而成。整理成统一的JSONL格式,并经过了严格的筛选和去重,确保数据的全面性、规模、可信性和高质量。总量大约在10B token,适合中文大语言模型的预训练。 @@ -271,7 +271,7 @@ streamlit run fast_inference.py | MiniMind训练数据集 | 下载地址 | |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------| | **【tokenizer训练集】** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [百度网盘](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) | -| **【Pretrain数据】** | [seq-monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) | +| **【Pretrain数据】** | [Seq-Monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) | | **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) | | **【DPO数据】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) | | **【DPO数据】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) | diff --git a/README_en.md b/README_en.md index 465fd12..023cebd 100644 --- a/README_en.md +++ b/README_en.md @@ -75,7 +75,8 @@ The project includes: - Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA fine-tuning, and DPO preference optimization, along with datasets and sources. - Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`. -- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and resuming at any +- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and + resuming at any point. - Code for testing the model on the Ceval dataset. - Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat @@ -223,7 +224,6 @@ git clone https://github.com/jingyaogong/minimind.git deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py ``` - # 📌 Data sources - 🤖 Tokenizer: In NLP, a Tokenizer is similar to a dictionary, mapping words from natural language to numbers like 0, 1, @@ -245,7 +245,7 @@ git clone https://github.com/jingyaogong/minimind.git sizes: | Tokenizer Model | Vocabulary Size | Source | - |----------------------|------------------|-----------------------| + |----------------------|------------------|-----------------------| | yi tokenizer | 64,000 | 01-AI (China) | | qwen2 tokenizer | 151,643 | Alibaba Cloud (China) | | glm tokenizer | 151,329 | Zhipu AI (China) | @@ -264,7 +264,7 @@ git clone https://github.com/jingyaogong/minimind.git --- - 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**: - The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) + The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books, etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is @@ -307,7 +307,7 @@ git clone https://github.com/jingyaogong/minimind.git | MiniMind Training Dataset | Download Link | |---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| | **[tokenizer Data]** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [Baidu](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) | -| **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) | +| **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) | | **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) | | **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) | | **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |