update readme
This commit is contained in:
parent
edb26ced7b
commit
4ceaa04d05
@ -232,7 +232,7 @@ streamlit run fast_inference.py
|
||||
---
|
||||
|
||||
- 📙【Pretrain数据】:
|
||||
[seq-monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)
|
||||
[Seq-Monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Seq-Monkey百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
|
||||
是由多种公开来源的数据(如网页、百科、博客、开源代码、书籍等)汇总清洗而成。整理成统一的JSONL格式,并经过了严格的筛选和去重,确保数据的全面性、规模、可信性和高质量。总量大约在10B
|
||||
token,适合中文大语言模型的预训练。
|
||||
|
||||
@ -271,7 +271,7 @@ streamlit run fast_inference.py
|
||||
| MiniMind训练数据集 | 下载地址 |
|
||||
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **【tokenizer训练集】** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [百度网盘](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
|
||||
| **【Pretrain数据】** | [seq-monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) |
|
||||
| **【Pretrain数据】** | [Seq-Monkey通用文本数据集](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [百度网盘](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) |
|
||||
| **【SFT数据】** | [匠数大模型SFT数据集](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
|
||||
| **【DPO数据】** | [活字数据集1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
|
||||
| **【DPO数据】** | [活字数据集2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |
|
||||
|
@ -75,7 +75,8 @@ The project includes:
|
||||
- Public MiniMind model code (including Dense and MoE models), code for Pretrain, SFT instruction fine-tuning, LoRA
|
||||
fine-tuning, and DPO preference optimization, along with datasets and sources.
|
||||
- Compatibility with popular frameworks such as `transformers`, `accelerate`, `trl`, and `peft`.
|
||||
- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and resuming at any
|
||||
- Training support for single-GPU and multi-GPU setups(DDP、DeepSpeed). The training process allows for stopping and
|
||||
resuming at any
|
||||
point.
|
||||
- Code for testing the model on the Ceval dataset.
|
||||
- Implementation of a basic chat interface compatible with OpenAI's API, facilitating integration into third-party Chat
|
||||
@ -223,7 +224,6 @@ git clone https://github.com/jingyaogong/minimind.git
|
||||
deepspeed --master_port 29500 --num_gpus=N 3-full_sft.py
|
||||
```
|
||||
|
||||
|
||||
# 📌 Data sources
|
||||
|
||||
- 🤖 Tokenizer: In NLP, a Tokenizer is similar to a dictionary, mapping words from natural language to numbers like 0, 1,
|
||||
@ -264,7 +264,7 @@ git clone https://github.com/jingyaogong/minimind.git
|
||||
---
|
||||
|
||||
- 📙 **[Pretrain Data](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)**:
|
||||
The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)
|
||||
The [Seq-Monkey General Text Dataset](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666)
|
||||
is a collection of data from various public sources such as websites, encyclopedias, blogs, open-source code, books,
|
||||
etc. It has been compiled, cleaned, and organized into a unified JSONL format, with rigorous filtering and
|
||||
deduplication to ensure data comprehensiveness, scale, reliability, and high quality. The total amount is
|
||||
@ -307,7 +307,7 @@ git clone https://github.com/jingyaogong/minimind.git
|
||||
| MiniMind Training Dataset | Download Link |
|
||||
|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **[tokenizer Data]** | [HuggingFace](https://huggingface.co/datasets/jingyaogong/minimind_dataset/tree/main) / [Baidu](https://pan.baidu.com/s/1yAw1LVTftuhQGAC1Y9RdYQ?pwd=6666) |
|
||||
| **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) |
|
||||
| **[Pretrain Data]** | [Seq-Monkey General Text Dataset](http://share.mobvoi.com:5000/sharing/O91blwPkY) / [Baidu](https://pan.baidu.com/s/114F1k3eksiWCOQLvaT3RYQ?pwd=6666) |
|
||||
| **[SFT Data]** | [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data/resolve/master/sft_data_zh.jsonl) |
|
||||
| **[DPO Data]** | [Huozi Dataset 1](https://huggingface.co/datasets/Skepsun/huozi_rlhf_data_json) |
|
||||
| **[DPO Data]** | [Huozi Dataset 2](https://huggingface.co/datasets/beyond/rlhf-reward-single-round-trans_chinese) |
|
||||
|
Loading…
x
Reference in New Issue
Block a user