赞
踩
基础环境
# 创建环境 conda create -n llama_factory python=3.10 conda activate llama_factory # 按照个人情况选择CUDA版本 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # 安装微调工具 git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install -r requirements.txt # 安装分布式加速训练库 pip install deepspeed
下载模型
# 镜像地址:https://hf-mirror.com/ conda activate base pip install -U huggingface_hub export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download --resume-download --local-dir-use-symlinks False 01-ai/Yi-34B-Chat --local-dir Yi-34B-Chat # 若下载中断导致文件校验失败,可通过--include或--exclude指定多个文件,以空格分隔
# 参考:https://github.com/hiyouga/LLaMA-Factory/issues/256 vi ds_config_lora.json { "bfloat16": { "enabled": false }, "fp16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_fp16_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 1e5, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
训练脚本
deepspeed --include localhost:0,1,2,3,4,5,6,7 src/train_bash.py \ --stage sft \ --do_train \ --model_name_or_path /path/to/model/Yi-34B-Chat/ \ --dataset alpaca_gpt4_zh \ --template yi \ --finetuning_type lora \ --lora_target k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \ --output_dir ./yi_sft_checkpoint \ --overwrite_cache \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --fp16 \ --deepspeed "./ds_config_lora.json"
ZeRO-1:分割Optimizer States;ZeRO-2:分割Optimizer States与Gradients;ZeRO-3:分割Optimizer States、Gradients与Parameters
量化训练
--quantization_bit 4 # ValueError: DeepSpeed ZeRO-3 is incompatible with quantization. # 注意:参数配置行之间不要保留注释,会导致其后的配置不生效
since you are offloading both parameters and optimizer state to CPU you would need roughly 18 bytes per model parameter. That means for 7B model you would need ~126GB of CPU memory. Please see page 3 of https://arxiv.org/pdf/1910.02054.pdf for a discussion of the memory breakdown.
通过启用DeepSpeed的ZeRO-3优化,可直接将模型拆分加载到显存中,初始化和训练过程中一直保持较低的内存占用。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。