update readme

This commit is contained in:
gongjy 2024-09-12 22:07:41 +08:00
parent edb26ced7b
commit 4ceaa04d05
2 changed files with 21 additions and 21 deletions

View File

@ -232,7 +232,7 @@ streamlit run fast_inference.py
--- ---
- 📙【Pretrain数据】 - 📙【Pretrain数据】
[seq-monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) [Seq-Monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Seq-Monkey百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
是由多种公开来源的数据如网页、百科、博客、开源代码、书籍等汇总清洗而成。整理成统一的JSONL格式并经过了严格的筛选和去重确保数据的全面性、规模、可信性和高质量。总量大约在10B 是由多种公开来源的数据如网页、百科、博客、开源代码、书籍等汇总清洗而成。整理成统一的JSONL格式并经过了严格的筛选和去重确保数据的全面性、规模、可信性和高质量。总量大约在10B
token适合中文大语言模型的预训练。 token适合中文大语言模型的预训练。
@ -271,7 +271,7 @@ streamlit run fast_inference.py
| MiniMind训练数据集 | 下载地址 | | MiniMind训练数据集 | 下载地址 |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------| |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **【tokenizer训练集】** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [百度网盘](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) | | **【tokenizer训练集】** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [百度网盘](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
| **【Pretrain数据】** | [seq-monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) | | **【Pretrain数据】** | [Seq-Monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) |
| **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) | | **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
| **【DPO数据】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) | | **【DPO数据】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
| **【DPO数据】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) | | **【DPO数据】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |

View File

@ -75,7 +75,8 @@ The project includes:
- Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA - Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA
fine-tuning, and DPO preference optimization, along with datasets and sources. fine-tuning, and DPO preference optimization, along with datasets and sources.
- Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`. - Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`.
- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and resuming at any - Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and
resuming at any
point. point.
- Code for testing the model on the Ceval dataset. - Code for testing the model on the Ceval dataset.
- Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat - Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat
@ -223,7 +224,6 @@ git clone https://github.com/jingyaogong/minimind.git
deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
``` ```
# 📌 Data sources # 📌 Data sources
- 🤖 Tokenizer: In NLP, a Tokenizer is similar to a dictionary, mapping words from natural language to numbers like 0, 1, - 🤖 Tokenizer: In NLP, a Tokenizer is similar to a dictionary, mapping words from natural language to numbers like 0, 1,
@ -264,7 +264,7 @@ git clone https://github.com/jingyaogong/minimind.git
--- ---
- 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**: - 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**:
The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books, is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books,
etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and
deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is
@ -307,7 +307,7 @@ git clone https://github.com/jingyaogong/minimind.git
| MiniMind Training Dataset | Download Link | | MiniMind Training Dataset | Download Link |
|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| |---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **[tokenizer Data]** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [Baidu](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) | | **[tokenizer Data]** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [Baidu](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
| **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) | | **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) |
| **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) | | **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
| **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) | | **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
| **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) | | **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |