当前位置:   article > 正文

Yi-34B微调训练_yi-34b-chat

yi-34b-chat

环境安装

  • 基础环境

# 创建环境
conda create -n llama_factory python=3.10
conda activate llama_factory
# 按照个人情况选择CUDA版本
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 安装微调工具
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -r requirements.txt
# 安装分布式加速训练库
pip install deepspeed
  • 下载模型

# 镜像地址:https://hf-mirror.com/
conda activate base
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download --local-dir-use-symlinks False 01-ai/Yi-34B-Chat --local-dir Yi-34B-Chat
# 若下载中断导致文件校验失败,可通过--include或--exclude指定多个文件,以空格分隔

训练过程

# 参考:https://github.com/hiyouga/LLaMA-Factory/issues/256
vi ds_config_lora.json
{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
  • 训练脚本

deepspeed --include localhost:0,1,2,3,4,5,6,7 src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path /path/to/model/Yi-34B-Chat/ \
    --dataset alpaca_gpt4_zh \
    --template yi \
    --finetuning_type lora \
    --lora_target k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --output_dir ./yi_sft_checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --deepspeed "./ds_config_lora.json"

ZeRO-1:分割Optimizer States;ZeRO-2:分割Optimizer States与Gradients;ZeRO-3:分割Optimizer States、Gradients与Parameters

  • 量化训练

--quantization_bit 4
# ValueError: DeepSpeed ZeRO-3 is incompatible with quantization.
# 注意:参数配置行之间不要保留注释,会导致其后的配置不生效

常见问题

内存占用超高

since you are offloading both parameters and optimizer state to CPU you would need roughly 18 bytes per model parameter. That means for 7B model you would need ~126GB of CPU memory. Please see page 3 of https://arxiv.org/pdf/1910.02054.pdf for a discussion of the memory breakdown.

参考:How to calculate the cpu memory required for DeepSpeedZeRoOffload initialization? · Issue #3606 · microsoft/DeepSpeed · GitHub

通过启用DeepSpeed的ZeRO-3优化,可直接将模型拆分加载到显存中,初始化和训练过程中一直保持较低的内存占用。

参考文档

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/376121
推荐阅读
相关标签
  

闽ICP备14008679号