赞
踩
模型名 | 模型大小 | 默认模块 | Template |
---|---|---|---|
Baichuan2 | 7B/13B | W_pack | baichuan2 |
BLOOM | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - |
BLOOMZ | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - |
ChatGLM3 | 6B | query_key_value | chatglm3 |
DeepSeek (MoE) | 7B/16B/67B | q_proj,v_proj | deepseek |
Falcon | 7B/40B/180B | query_key_value | falcon |
Gemma | 2B/7B | q_proj,v_proj | gemma |
InternLM2 | 7B/20B | wqkv | intern2 |
LLaMA | 7B/13B/33B/65B | q_proj,v_proj | - |
LLaMA-2 | 7B/13B/70B | q_proj,v_proj | llama2 |
Mistral | 7B | q_proj,v_proj | mistral |
Mixtral | 8x7B | q_proj,v_proj | mistral |
Phi-1.5/2 | 1.3B/2.7B | q_proj,v_proj | - |
Qwen | 1.8B/7B/14B/72B | c_attn | qwen |
Qwen1.5 | 0.5B/1.8B/4B/7B/14B/72B | q_proj,v_proj | qwen |
XVERSE | 7B/13B/65B | q_proj,v_proj | xverse |
Yi | 6B/34B | q_proj,v_proj | yi |
Yuan | 2B/51B/102B | q_proj,v_proj | yuan |
[!IMPORTANT]
如果您使用多张 GPU 训练模型,请移步多 GPU 分布式训练部分。
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage pt \ --do_train \ --model_name_or_path path_to_llama_model \ --dataset wiki_demo \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir path_to_pt_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage sft \ --do_train \ --model_name_or_path path_to_llama_model \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir path_to_sft_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage rm \ --do_train \ --model_name_or_path path_to_llama_model \ --adapter_name_or_path path_to_sft_checkpoint \ --create_new_adapter \ --dataset comparison_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir path_to_rm_checkpoint \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 1e-6 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --stage ppo \ --do_train \ --model_name_or_path path_to_llama_model \ --adapter_name_or_path path_to_sft_checkpoint \ --create_new_adapter \ --dataset alpaca_gpt4_zh \ --template default \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --reward_model path_to_rm_checkpoint \ --output_dir path_to_ppo_checkpoint \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --top_k 0 \ --top_p 0.9 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 1e-5 \ --num_train_epochs 1.0 \ --plot_loss \ --fp16
这些命令行参数用于在单GPU上进行不同类型的模型训练,包括预训练、指令监督微调、奖励模型训练和PPO训练。下面是对每个参数的详细解释:
pt
代表预训练,sft
代表指令监督微调,rm
代表奖励模型训练,ppo
代表PPO训练。CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--stage pt
--do_train
--model_name_or_path qwen/Qwen-14B
--dataset wiki_demo
--finetuning_type lora
--lora_target c_attn
--output_dir path_to_pt_checkpoint
--overwrite_cache --per_device_train_batch_size 4
--gradient_accumulation_steps 4 --lr_scheduler_type cosine
--logging_steps 10 --save_steps 1000 --learning_rate 5e-5
--num_train_epochs 3.0 --plot_loss --fp16
这里我们看到llama factory的预训练也是基于lora进行预训练的。
显存占用38GB
那么 接下来我们尝试多卡进行 qwen/Qwen-14B lora 预训练
首先配置accelerate,输入只有accelerate config,剩下的内容都是选项。
accelerate config In which compute environment are you running? This machine Which type of machine are you using? multi-GPU How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: yes Do you wish to optimize your script with torch dynamo?[yes/NO]:yes Which dynamo backend would you like to use? tensorrt Do you want to customize the defaults sent to torch.compile? [yes/NO]: Do you want to use DeepSpeed? [yes/NO]: NO Do you want to use FullyShardedDataParallel? [yes/NO]: M^HNPO^H^H Please enter yes or no. Do you want to use FullyShardedDataParallel? [yes/NO]: NO Do you want to use Megatron-LM ? [yes/NO]: yes What is the Tensor Parallelism degree/size? [1]:1 What is the Pipeline Parallelism degree/size? [1]:1 Do you want to enable selective activation recomputation? [YES/no]: 1 Please enter yes or no. Do you want to enable selective activation recomputation? [YES/no]: YES Do you want to use distributed optimizer which shards optimizer state and gradients across data parallel ranks? [YES/no]: YES What is the gradient clipping value based on global L2 Norm (0 to disable)? [1.0]: 1 How many GPU(s) should be used for distributed training? [1]:3 Do you wish to use FP16 or BF16 (mixed precision)? bf16 accelerate configuration saved at /home/ca2/.cache/huggingface/accelerate/default_config.yaml
您已经成功地为多GPU训练环境配置了accelerate
。以下是您提供的配置的简要概述以及每个选项的含义:
torch dynamo
来优化您的PyTorch代码,这可以提高性能。tensorrt
作为后端,这通常用于生产环境,可以提供优化的代码。FullyShardedDataParallel
,这是一个用于数据并行的PyTorch分布式训练的库。Megatron-LM
,这是一个用于大规模语言模型训练的PyTorch扩展。CUDA_VISIBLE_DEVICES
)设置正确,以便accelerate
可以识别和使用您指定的GPU。accelerate launch src/train_bash.py --stage pt --do_train --model_name_or_path qwen/Qwen-14B --dataset wiki_demo --finetuning_type lora --lora_target c_attn --output_dir path_to_pt_checkpoint --overwrite_cache --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --logging_steps 10 --save_steps 1000 --learning_rate 5e-5 --num_train_epochs 3.0 --plot_loss --fp16
成功训练
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。