update lr

This commit is contained in:
gongjy 2025-02-11 23:52:40 +08:00
parent fea5b0eafc
commit d2f5ef4355
4 changed files with 59 additions and 43 deletions

View File

@ -209,22 +209,26 @@ git clone https://github.com/jingyaogong/minimind.git
## 测试已有模型效果 ## 测试已有模型效果
### 1.下载模型 ### 1.环境准备
```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
### 2.下载模型
```bash ```bash
# step 1
git clone https://huggingface.co/jingyaogong/MiniMind2 git clone https://huggingface.co/jingyaogong/MiniMind2
``` ```
### 2.命令行问答 ### 3.命令行问答
```bash ```bash
# step 2 # load=0: load from pytorch model, load=1: load from transformers-hf model
# load=1: load from transformers-hf model
python eval_model.py --load 1 python eval_model.py --load 1
``` ```
### 3.或启动WebUI ### 4.或启动WebUI
```bash ```bash
# 可能需要`python>=3.10` 安装 `pip install streamlit` # 可能需要`python>=3.10` 安装 `pip install streamlit`
@ -323,26 +327,29 @@ python eval_model.py --model_mode 1 # 默认为0测试pretrain模型效果
单机N卡启动训练方式 (DDP, 支持多机多卡集群) 单机N卡启动训练方式 (DDP, 支持多机多卡集群)
```bash ```bash
torchrun --nproc_per_node 3 train_xxx.py torchrun --nproc_per_node N train_xxx.py
``` ```
<details style="color:rgb(128,128,128)"> <details style="color:rgb(128,128,128)">
<summary>注:其它须知</summary> <summary>注:其它须知</summary>
* 单机N卡启动训练 (DeepSpeed) 单机N卡启动训练 (DeepSpeed)
```bash
deepspeed --master_port 29500 --num_gpus=N train_xxx.py
```
* 可根据需要开启wandb记录训练过程 ```bash
```bash deepspeed --master_port 29500 --num_gpus=N train_xxx.py
# 需要登录: wandb login ```
torchrun --nproc_per_node N train_xxx.py --use_wandb
# and 可根据需要开启wandb记录训练过程
python train_xxx.py --use_wandb
``` ```bash
通过添加`--use_wandb`参数可以记录训练过程训练完成后可以在wandb网站上查看训练过程。通过修改`wandb_project` # 需要登录: wandb login
`wandb_run_name`参数,可以指定项目名称和运行名称。 torchrun --nproc_per_node N train_xxx.py --use_wandb
# and
python train_xxx.py --use_wandb
```
通过添加`--use_wandb`参数可以记录训练过程训练完成后可以在wandb网站上查看训练过程。通过修改`wandb_project`
`wandb_run_name`参数,可以指定项目名称和运行名称。
</details> </details>

View File

@ -221,22 +221,28 @@ git clone https://github.com/jingyaogong/minimind.git
## Test Pre-trained Model ## Test Pre-trained Model
### 1. Download the Model
### 1. Environment Setup
```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
### 2. Download the Model
```bash ```bash
# step 1
git clone https://huggingface.co/jingyaogong/MiniMind2 git clone https://huggingface.co/jingyaogong/MiniMind2
``` ```
### 2. Command-line Q&A ### 3. Command-line Q&A
```bash ```bash
# step 2 # load=0: load from pytorch model, load=1: load from transformers-hf model
# load=1: load from transformers-hf model
python eval_model.py --load 1 python eval_model.py --load 1
``` ```
### 3. Or Start WebUI ### 4. Or Start WebUI
```bash ```bash
# You may need `python>=3.10` and install `pip install streamlit`. # You may need `python>=3.10` and install `pip install streamlit`.
@ -347,27 +353,30 @@ SFT-Chat model, 2: RLHF-Chat model, 3: Reason model.
Start training with N GPUs on a single machine (DDP, supports multi-node, multi-GPU clusters): Start training with N GPUs on a single machine (DDP, supports multi-node, multi-GPU clusters):
```bash ```bash
torchrun --nproc_per_node 3 train_xxx.py torchrun --nproc_per_node N train_xxx.py
``` ```
<details style="color:rgb(128,128,128)"> <details style="color:rgb(128,128,128)">
<summary>Note: Others</summary> <summary>Note: Others</summary>
* Start training with N GPUs on a single machine (DeepSpeed): Start training with N GPUs on a single machine (DeepSpeed):
```bash
deepspeed --master_port 29500 --num_gpus=N train_xxx.py
```
* Enable wandb to record the training process if needed: ```bash
```bash deepspeed --master_port 29500 --num_gpus=N train_xxx.py
# Need to log in: wandb login ```
torchrun --nproc_per_node N train_xxx.py --use_wandb
# and Enable wandb to record the training process if needed:
python train_xxx.py --use_wandb
``` ```bash
By adding the `--use_wandb` parameter, the training process will be recorded, and after training, you can view the # Need to log in: wandb login
process on the wandb website. Modify the `wandb_project` and `wandb_run_name` parameters to specify project and run torchrun --nproc_per_node N train_xxx.py --use_wandb
names. # and
python train_xxx.py --use_wandb
```
By adding the `--use_wandb` parameter, the training process will be recorded, and after training, you can view the
process on the wandb website. Modify the `wandb_project` and `wandb_run_name` parameters to specify project and run
names.
</details> </details>

View File

@ -123,7 +123,7 @@ if __name__ == "__main__":
parser = argparse.ArgumentParser(description="MiniMind Full SFT") parser = argparse.ArgumentParser(description="MiniMind Full SFT")
parser.add_argument("--out_dir", type=str, default="out") parser.add_argument("--out_dir", type=str, default="out")
parser.add_argument("--epochs", type=int, default=6) parser.add_argument("--epochs", type=int, default=6)
parser.add_argument("--batch_size", type=int, default=128) parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument("--learning_rate", type=float, default=5e-5) parser.add_argument("--learning_rate", type=float, default=5e-5)
parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu") parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
parser.add_argument("--dtype", type=str, default="bfloat16") parser.add_argument("--dtype", type=str, default="bfloat16")

View File

@ -120,7 +120,7 @@ if __name__ == "__main__":
parser.add_argument("--out_dir", type=str, default="out") parser.add_argument("--out_dir", type=str, default="out")
# 若要以最快速度实现zero则epochs设置为1轮否则应当利用有限的数据训练2~6个epochs。 # 若要以最快速度实现zero则epochs设置为1轮否则应当利用有限的数据训练2~6个epochs。
parser.add_argument("--epochs", type=int, default=1) parser.add_argument("--epochs", type=int, default=1)
parser.add_argument("--batch_size", type=int, default=128) parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument("--learning_rate", type=float, default=5e-4) parser.add_argument("--learning_rate", type=float, default=5e-4)
parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu") parser.add_argument("--device", type=str, default="cuda:0" if torch.cuda.is_available() else "cpu")
parser.add_argument("--dtype", type=str, default="bfloat16") parser.add_argument("--dtype", type=str, default="bfloat16")