update readme format

2024-08-28 16:50:40 +08:00 · 2024-08-28 16:50:40 +08:00 · 4d1d4fae0a
commit 4d1d4fae0a
parent 8be42693f6
2 changed files with 10 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -9,6 +9,10 @@
 </div>
 <div align="center">
  <h3>"大道至简"</h3>
 </div>
 <div align="center">
 中文 | [English](./README_en.md)
@ -16,12 +20,6 @@
 </div>
 <p align="center">
  <span style="font-size: 2em; font-weight: bold;">
    “大道至简”<br/>
  </span>
 </p>
 * 本开源项目旨在完全从0开始，训练出仅为26M大小的微型语言模型**MiniMind**。
 * **MiniMind**极其轻量，体积约是 GPT3 的 $\frac{1}{7000}$，力求做到CPU也可快速推理甚至训练。
 * **MiniMind**改进自DeepSeek-V2、Llama3结构，项目包含整个数据处理、pretrain、sft、dpo的全部阶段，包含混合专家(MoE)模型。
@ -182,8 +180,7 @@ python 2-eval.py
 ---
-
+- 📙【Pretrain数据】：[seq-monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)
 📙【Pretrain数据】：[seq-monkey通用文本数据集](https://github.com/mobvoi/seq-monkey-data/blob/main/docs/pretrain_open_corpus.md)
 是由多种公开来源的数据（如网页、百科、博客、开源代码、书籍等）汇总清洗而成。
 整理成统一的JSONL格式，并经过了严格的筛选和去重，确保数据的全面性、规模、可信性和高质量。
 总量大约在10B token，适合中文大语言模型的预训练。
--- a/README_en.md
+++ b/README_en.md
@ -8,17 +8,17 @@
 [![Collection](https://img.shields.io/badge/🤗-MiniMind%20%20Collection-blue)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
 </div>
 <div align="center">
  <h3>"The Greatest Path is the Simplest"</h3>
 </div>
 <div align="center">
 [中文](./README.md) | English
 </div>
 <p align="center">
  <span style="font-size: 1.5em; font-weight: bold;">
    "The Greatest Path is the Simplest"<br/>
  </span>
 </p>
 * This open-source project aims to train a miniature language model **MiniMind** from scratch, with a size of just 26MB.
 * **MiniMind** is extremely lightweight, approximately $\frac{1}{7000}$ the size of GPT-3, designed to enable fast