赞
踩
ZeRO
论文:《ZeRO:Memory Optimizations Toward Training Trillion Parameter Models》ZeRO-Offload
论文:《ZeRO-Offload:Democratizing Billion-Scale Model Training.》NVMe
技术论文:《 ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning》- DeepSpeed Github、DeepSpeed 官网
- 《Hugging Face高效训练技术二:大模型分布式训练策略——ZeRO、FSDP》、《Hugging Face高效训练技术三:huggingface DeepSpeed文档》
ZeRO
(Zero Redundancy Optimizer)是一种用于优化大规模深度学习模型训练的技术。它的主要目标是降低训练期间的内存占用、通信开销和计算负载,从而使用户能够训练更大的模型并更高效地利用硬件资源。
ZERO
论文首先分析了模型训练中内存主要消耗在两个方面:
model states
:模型状态,包括包括优化器参数(例如Adam的动量和方差)、梯度、模型参数residual states
:剩余状态,包括包括激活函数、临时缓冲区、内存碎片参数解释:
Baseline
:未优化的基线Ψ
:模型大小,上图假设模型参数为Ψ=75亿K
:存储优化器状态要消耗的内存倍数,对于混合精度的Adam优化器而言,K=12
- N d N_d Nd:数据并行度。基于Adam优化器的混合精度训练,数据并行度为Nd=64(即64个GPU)
ZeRO
分别使用ZeRO-DP
和ZeRO-R
来优化model states
和residual states
。如上图所示,ZeRO-DP
包括三个阶段。
ZeRO-Infinity
是ZeRO
的一个扩展版本,它允许将模型参数存储在CPU内存或NVMe存储上,而不是全部存储在GPU内存中,最终在有限资源下能够训练前所未有规模的模型(在单个NVIDIA DGX-2节点上微调具有1万亿参数的模型),而无需对模型代码进行重构。
NVMe
协议是专门为固态硬盘设计的,以满足高性能、低延迟和并发读写的需求。相较于SATA和AHCI,NVMe
具有更高的数据传输速率和更低的通信延迟。
DeepSpeed实现了上述ZeRO论文中描述的所有内容。目前,它完全支持以下功能:
详细原理可参考ZeRO论文,或者我之前的帖子《Hugging Face高效训练技术二:大模型分布式训练策略——ZeRO、FSDP》
- 本章参考《 DeepSpeed(Accelerate)》
- 有关Accelerate库的使用,可参考我的博客《Accelerate 0.24.0文档 一:两万字极速入门》
DeepSpeed ZeRO-2主要仅用于训练,因为推理时不需要优化器和梯度。DeepSpeed ZeRO-3也可用于推断,因为它允许将庞大的模型加载到多个GPU上(参数分区)。
通过2种选项,Accelerate集成了DeepSpeed:
通过在 accelerate config
中使用deepspeed config file
集成DeepSpeed功能。您只需提供自定义配置文件或使用我们的模板。这支持DeepSpeed的所有核心功能,并为用户提供了很大的灵活性,用户只需要根据配置做一些更改。
通过deepspeed_plugin
集成。这种集成方式支持DeepSpeed的一部分功能,对于其余配置使用默认选项。这种方式无需进行复杂的代码修改,集成更加简便,适用于对DeepSpeed的大多数默认设置满意的用户。
无论哪种方式,使用前先安装DeepSpeed,安装方式有两种:pip安装和本地构建。
pip安装
# 两种方式任选其一
pip install deepspeed # 安装deepspeed库
pip install transformers[deepspeed] # 通过transformers的extras选项安装
本地构建
pip安装通常会使用默认配置,适合绝大多数用户。如果您需要自定义DeepSpeed的配置,比如修改全局配置文件或在代码中进行相应的配置更改,可以克隆DeepSpeed项目到本地来自定义构建。
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build # 移除旧的构建目录
# 针对所需GPU架构进行本地构建:(需替换相应GPU架构)
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log
有关安装部分更详细内容,请查看官方文档。
首先运行accelerate config
,这将启动一个配置向导,询问您是否要使用DeepSpeed的配置文件。您应该回答"no",然后继续回答后续问题,以生成一个基本的DeepSpeed配置(包含一系列默认选项)。
运行以下命令,使用生成的DeepSpeed配置文件(yaml格式)启动训练脚本:
accelerate launch my_script.py --args_to_my_script
下面介绍使用 DeepSpeed Plugin
运行 NLP的示例(examples/nlp_example.py)
accelerate配置文件:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero_stage: 2 distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false
启动:
accelerate launch examples/nlp_example.py --mixed_precision fp16
accelerate 配置文件:
compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 use_cpu: false
启动:
accelerate launch examples/nlp_example.py --mixed_precision fp16
Accelerate 支持通过CLI进行以下配置:
`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
`gradient_clipping`: Enable gradient clipping with value.
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training.
为了能更灵活的配置DeepSpeed功能,推荐使用DeepSpeed config file。运行accelerate config
,这将启动一个配置向导,询问您是否要使用DeepSpeed的配置文件。您应该回答"yes"并提供 deepspeed 配置文件的路径,回答完毕后,这将生成一个配置文件来正确设置默认选项。
accelerate launch my_script.py --args_to_my_script
下面介绍如何使用 DeepSpeed 配置文件运行 NLP 示例(examples/by_feature/deepspeed_with_config_support.py )
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage2_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
{ "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto", "torch_adam": true, "adam_w_mode": true } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto", "total_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": "auto", "contiguous_gradients": true }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage2_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 24 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: /home/ubuntu/accelerate/examples/configs/deepspeed_config_templates/zero_stage3_offload_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
{ "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupDecayLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto", "total_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "sub_group_size": 1e9, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": "auto" }, "gradient_accumulation_steps": 1, "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage3_offload_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\
在《Hugging Face高效训练技术三:huggingface DeepSpeed文档(Trainer)》6.2节优化器章节介绍过,当不启用offload_optimizer(优化器卸载)功能时,可以混合使用HF和DS的优化器和迭代器,组合情况如下:(只是不能同时使用deepspeed 优化器和huggingface的调度器)
组合 | HF Scheduler | DS Scheduler |
---|---|---|
HF Optimizer | Yes | Yes |
DS Optimizer | No | Yes |
DeepSpeed 的主要优化器是 Adam、AdamW、OneBitAdam 和 Lamb。 这些已通过 ZeRO 进行了彻底测试,建议使用。但如果您有特殊需求,也可以从 torch 导入其他优化器,详见文档。
本文介绍的是Accelerate集成DeepSpeed 的情况,情况也是类似:
accelerate.utils.DummyOptim
和 accelerate.utils.DummyScheduler
来替换代码中的 PyTorch/自定义优化器和调度器。以下是 examples/by_feature/deepspeed_with_config_support.py 中显示这一情况的代码片段:from accelerate.utils import DummyOptim, DummyScheduler, set_seed ... # 如果 config file定义了Optimizer则创建Dummy Optimizer,否则创建Adam Optimizer optimizer_cls = ( torch.optim.AdamW if accelerator.state.deepspeed_plugin is None or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config else DummyOptim ) optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate) # 如果 config file定义了scheduler则创建Dummy Scheduler,否则创建`args.lr_scheduler_type` Scheduler if ( accelerator.state.deepspeed_plugin is None or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config ): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, num_warmup_steps=args.num_warmup_steps, num_training_steps=args.max_train_steps, ) else: lr_scheduler = DummyScheduler( optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps ) model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader, lr_scheduler) ...
Custom Optim + Custom Scheduler
:DeepSpeed 配置文件中没有定义optimizer 和 scheduler,则代码保持不变,通过 DeepSpeed Plugin使用集成时就是这种情况。Custom Optim + DS Scheduler
:DeepSpeed 配置文件中仅定义 scheduler的情况,此时,用户必须使用 accelerate.utils.DummyScheduler
替换代码中的 PyTorch/Custom Scheduler。DS Optim + Custom Scheduler
:DeepSpeed 配置文件中仅定义 optimizer 的情况,此时会报错。上述内容展开来说就是:
DeepSpeed Config File
中定义了优化器,就会被替换为DummyOptim
。DummyOptim
对象进行替换。DummyOptim
是一个虚拟的优化器对象,不执行任何实际的优化运算,仅作为一个占位符存在,所以不会对模型参数执行优化。DummyScheduler
是虚拟的学习率调度器,其用法一样Accelerate
不会自动帮用户代码中原有的Optimizer
和 Scheduler
替换成 Dummy,需要用户自己手动替换,比如:import accelerate
# 原优化器
optimizer = torch.optim.Adam(model.parameters())
optimizer = accelerate.DistributedOptimizer(optimizer, name='adam')
from accelerate.utils import DummyOptim
# 使用 DummyOptim 替换 Accelerate 封装的优化器
optimizer = DummyOptim()
另外需要注意,上述示例的DeepSpeed配置文件中的 auto 值,会根据传递给 prepare 方法的模型、数据加载器、虚拟优化器和虚拟调度器自动处理,其余的必须由用户显式指定。
在使用 deepspeed_config_file 时,如果某些变量在 accelerate 中也进行了配置,那么这些变量的配置可能会发生冲突或重复,为了避免这种情况,最好将它们全部配置在 deepspeed_config_file 中。以下是可能会产生冲突的变量列表:
gradient_accumulation_steps
gradient_clipping
zero_stage
offload_optimizer_device
offload_param_device
zero3_save_16bit_model
mixed_precision
下面是一个更具体的例子:
from accelerate import Accelerator
from accelerate.state import AcceleratorState
def main():
accelerator = Accelerator()
accelerator.print(f"{AcceleratorState()}")
if __name__ == "__main__":
main()
command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: 'cpu' offload_param_device: 'cpu' zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 deepspeed_config_file: 'ds_config.json' distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false
{ "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "stage3_gather_16bit_weights_on_model_save": false, "offload_optimizer": { "device": "none" }, "offload_param": { "device": "none" } }, "gradient_clipping": 1.0, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": 10, "steps_per_print": 2000000 }
此时运行accelerate launch test.py
,会产生以下报错:
ValueError: When using `deepspeed_config_file`, the following accelerate config variables will be ignored:
['gradient_accumulation_steps', 'gradient_clipping', 'zero_stage', 'offload_optimizer_device', 'offload_param_device',
'zero3_save_16bit_model', 'mixed_precision'].
Please specify them appropriately in the DeepSpeed config file.
If you are using an accelerate config file, remove others config variables mentioned in the above specified list.
The easiest method is to create a new config following the questionnaire via `accelerate config`.
It will only ask for the necessary config variables when using `deepspeed_config_file`.
为了解决这个问题,建议通过运行 accelerate config
来创建一个新的配置文件。此命令将通过一系列问题询问用户,仅在使用 deepspeed_config_file
时要求用户提供必要的配置变量,以确保配置的一致性。
$ accelerate config
-------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: ds_config.json
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:4
accelerate configuration saved at ds_config_sample.yaml
accelerate config:
compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: ds_config.json zero3_init_flag: true distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true use_cpu: false
此时运行accelerate launch test.py
,会输出:
Distributed environment: DEEPSPEED Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': False, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 10, 'steps_per_print': inf, 'fp16': {'enabled': False}}
{ "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": "auto", "stage3_gather_16bit_weights_on_model_save": "auto", "offload_optimizer": { "device": "auto" }, "offload_param": { "device": "auto" } }, "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "steps_per_print": 2000000 }
accelerate launch test.py
--mixed_precision="fp16" \
--zero_stage=3 \
--gradient_accumulation_steps=5 \
--gradient_clipping=1.0 \
--offload_param_device="cpu" \
--offload_optimizer_device="nvme" \
--zero3_save_16bit_model="true" \
Distributed environment: DEEPSPEED Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
ds_config: {'bf16': {'enabled': False}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': True, 'offload_optimizer': {'device': 'nvme'}, 'offload_param': {'device': 'cpu'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 5, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
注意:
accelerator.prepare()
调用中处理,也就是由代码逻辑自动确定。例如 当gradient_accumulation_steps
被设置为"auto"时,通过Accelerator(gradient_accumulation_steps=k)
创建加速器对象时传入的 gradient_accumulation_steps
参数才会生效在 ZeRO Stage-1 和 Stage-2 下,保存和加载模型的方式不变。在 ZeRO Stage-3 下,由于模型权重被分区到多个GPU上,state_dict 只包含空占位符。Stage-3 有两种保存方式:
model.load_state_dict(torch.load(pytorch_model.bin))
加载。unwrapped_model = accelerator.unwrap_model(model) # 解包模型
# Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if
# `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or
# `zero3_save_16bit_model` is True in DeepSpeed Plugin.
# For Zero Stages 1 and 2, models are saved as usual in the output directory.
# The model name saved is `pytorch_model.bin`
unwrapped_model.save_pretrained(
args.output_dir,
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
state_dict=accelerator.get_state_dict(model),
)
args.output_dir
为保存目录,is_main_process
表示只在主进程上执行保存,accelerator.save
为保存函数,最后通过accelerator.get_state_dict(model)
获取模型的状态字典。这样就可以保存下来完整的 16 位权重模型,之后直接加载使用。首先通过model.save_checkpoint()
保存检查点。这将在检查点目录下创建分区后的ZeRO模型和优化器,以及zero_to_fp32.py脚本。可以使用该脚本进行脱机合并得到完整的32位权重。
合并脚本简单易用,不需要DeepSpeed的配置文件,还可以在CPU上运行
success = model.save_checkpoint(PATH, ckpt_id, checkpoint_state_dict)
status_msg = "checkpointing: PATH={}, ckpt_id={}".format(PATH, ckpt_id)
if success:
logging.info(f"Success {status_msg}")
else:
logging.warning(f"Failure {status_msg}")
$ cd /path/to/checkpoint_dir
$ ./zero_to_fp32.py . pytorch_model.bin
Processing zero checkpoint at global_step1
Detected checkpoint of type zero stage 3, world_size: 2
Saving fp32 state dict to pytorch_model.bin (total_numel=60506624)
如果只需要得到state_dict,可以使用如下代码:
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)
如果要加载此32位权重进行推理,可以使用如下代码:
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
unwrapped_model = accelerator.unwrap_model(model)
fp32_model = load_state_dict_from_zero_checkpoint(unwrapped_model, checkpoint_dir)
这些操作大约需要原始检查点大小2倍的内存
ZeRO Inference 支持 ZeRO stage 3 和 ZeRO-Infinity,它使用与训练相同的 ZeRO 协议,但不使用优化器和学习率调度器,所以只有 ZeRO stage 3 对推理有用。使用时,你只需要按照下面所示准备模型和数据加载器:
model, eval_dataloader = accelerator.prepare(model, eval_dataloader)
注意事项:
最后,请记住,
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。