赞
踩
模型并行(MP:https://zhuanlan.zhihu.com/p/366906920): 将模型横向或垂直分割,将计算和参数划分到每一层,跨多个设备,需要每一层之间的重要通信。在 GPU 之间通信带宽高的单个节点内工作良好,但跨节点工作会较慢。
ZeRO:同样是跨多个设备,将模型和参数放到不同设备,但是通信量却大大减少。
SGD: 没有动量概念
Adam优化器:Adam在SGD基础上,为每个参数梯度增加了一阶动量(momentum)和二阶动量(variance)
针对模型状态的存储优化,ZeRO使用的方法是分片,即每张卡只存 1/N 的模型状态量,这样系统内只维护一份模型状态。
ZeRO-1:optimizer分片
ZeRO-2:optimizer + Gradient分片
ZeRO-3:optimizer + Gradient + model分片
VS DDP
解决的问题:Optimizer state的冗余。
没有即没有将模型本身进行分,也没有将Gradient进行分片,而是只将优化器进行分片。
过程动画:https://zhuanlan.zhihu.com/p/394064174 (原文链接)
动画链接(可能会过期):https://vdn6.vzuu.com/SD/bd03b0bc-ef95-11eb-8ee1-ce96bf022449.mp4?pkey=AAWvE2ChU9kHMLO_n5M8CJeSDHqfRZRy2dMrPU4eOnfzHOWOaGYoGlxJIEAzhzJT4Fgsk-wW1oESc3ngsFZcCFRM&c=avc.0.0&f=mp4&pu=078babd7&bu=078babd7&expiration=1660109505&v=ks6
解决的问题:gradient 的冗余。
为了减少梯度Gradient冗余以此进一步节省内存,ZeRO-2提出gradient sharding,在FairScale里称之为Sharded Data Parallel(SDP)。相比与ZeRO-1, ZeRO-2除了对optimizer state进行切分,还对Gradient进行了切分。
1.3.3 ZeRO-3
解决的问题:model参数分割。
为了进一步节省更多的内存,ZeRO-3提出进行模型参数的分片。类似以上两种分片方式,ranks仅负责模型参数的切片。可以进行参数切片的原因主要有以下两点:
问题二:请分析分析MP @方恺齐 的通信量?
https://github.com/pytorch/pytorch/blob/b91ff5e361623685799b8ef725a91b756685a9ae/torch/distributed/fsdp/fully_sharded_data_parallel.py#L462
代码问题:
1、FSDP对模型参数是怎么进行操作和存储的,和普通的DDP有什么不同?
将参数拉平,并存储了每个参数的size
2、FSDP分别是在哪个函数实现模型参数分割和合并的?
_sharded_parameters
3、对于模型参数不能均分的情况,FSDP采用了什么策略?
pad
https://github.com/pytorch/pytorch/blob/e81664449559f95d0b8d0fe57d66544a0ab84fe8/torch/distributed/fsdp/fully_sharded_data_parallel.py#L3237
import argparse import os import torch import torch.distributed as dist import torch.multiprocessing as mp import torch.nn as nn import torch.optim as optim from fairscale.optim.oss import OSS from torch.nn.parallel import DistributedDataParallel as DDP from fairscale.nn.data_parallel import ShardedDataParallel as SDP from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP from torch.cuda.amp import autocast from torch.cuda.amp import GradScaler scaler = GradScaler() class MyModel(nn.Module): def __init__(self, vocab_size, embed_dim, inner_dim, hidden_dim, num_choices, nlayers=2): super().__init__() self.nlayers = nlayers self.embed = nn.Embedding(vocab_size, embed_dim) self.linear = nn.Linear(embed_dim, hidden_dim) self.fn = nn.Sequential(nn.Linear(hidden_dim, inner_dim), nn.ReLU(), nn.Linear(inner_dim, hidden_dim)) self.drop = nn.Dropout(0.1) self.classifier = nn.Linear(hidden_dim, num_choices) def forward(self, input_ids): embed = self.embed(input_ids) v = self.linear(embed) v = self.fn(v) last_token_hidden = v[:, -1] last_token_hidden = self.drop(last_token_hidden) logits = self.classifier(last_token_hidden) return logits def initialize_distributed(args): """Initialize torch.distributed.""" # Manually set the device ids. device = args.rank % torch.cuda.device_count() print(f'rank = {args.rank} || local_rank = {args.local_rank}') if args.local_rank is not None: device = args.local_rank torch.cuda.set_device(device) init_method = 'tcp://' master_ip = os.getenv('MASTER_ADDR', 'localhost') master_port = os.getenv('MASTER_PORT', '6000') init_method += master_ip + ':' + master_port print(f'init_method = {init_method}') dist.init_process_group(backend='nccl', world_size=args.world_size, rank=args.rank, init_method=init_method) dist.all_reduce(torch.zeros(1).cuda()) parser = argparse.ArgumentParser(description='PyTorch Transformer Language Model') parser.add_argument('--autocast', action='store_true', help='Run in pytorch autocast mode.') parser.add_argument('--zero3', action='store_true', help='Run in pytorch autocast mode.') parser.add_argument('--zero2', action='store_true', help='Run in pytorch autocast mode.') parser.add_argument('--zero1', action='store_true', help='Run in pytorch autocast mode.') parser.add_argument('--local_rank', type=int, default=None, help='local rank passed from distributed launcher') args = parser.parse_args() args.rank = int(os.getenv('RANK', '0')) args.world_size = int(os.getenv("WORLD_SIZE", '1')) initialize_distributed(args) total_steps = 10000000 batch_size = 1 vocab_size = 20000 data_len = 512 embed_dim = 10000 inner_dim = 10000 hidden_dim = 20000 num_choices = 10000 loss_fn = nn.CrossEntropyLoss() device = "cuda:{}".format(torch.cuda.current_device()) model = MyModel(vocab_size, embed_dim, inner_dim, hidden_dim, num_choices) model.to(device) n_all_param = sum([p.nelement() for p in model.parameters()]) if args.rank == 0: print(f'n_all_param: {n_all_param}') for k, v in model.named_parameters(): print(f'rank: {args.rank} --- {k} shape: {v.shape}') if args.zero1: base_optimizer_arguments = {'lr':0.05} base_optimizer = torch.optim.Adam optimizer = OSS( params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments) model = DDP(model) elif args.zero2: base_optimizer_arguments = {'lr':0.05} base_optimizer = torch.optim.Adam optimizer = OSS( params=model.parameters(), optim=base_optimizer, **base_optimizer_arguments) model = SDP(model, optimizer) if args.autocast: from fairscale.optim.grad_scaler import ShardedGradScaler scaler = ShardedGradScaler() elif args.zero3: print(f'zero3 ----') optimizer = optim.Adam(model.parameters(), lr=0.05) model = FSDP(model, mixed_precision=True) else: optimizer = optim.Adam(model.parameters(), lr=0.05) model = DDP(model) if args.rank == 0: for k, v in model.named_parameters(): print(f'rank: {args.rank} --- {k} shape: {v.shape}') for step in range(total_steps): model.zero_grad() data = torch.randint(vocab_size, (batch_size, data_len)).to(device) labels = torch.randint(2, [batch_size]).to(device) if args.autocast: with autocast(): logits = model(data) loss = loss_fn(logits, labels) scaler.scale(loss).backward() scaler.unscale_(optimizer) scaler.step(optimizer) scaler.update() else: logits = model(data) loss = loss_fn(logits, labels) loss.backward() optimizer.step() if args.rank == 0: print(f'step: {step} loss: {loss.item()}')
运行上述代码脚本: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=3 python -W ignore -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 8927 fairscale_fsdp.py
结果:
选答题:
GPT-2(1.5B)在ZeRO-1,ZeRO-2,ZeRO-3模式,2张显卡的情况下,每张显卡内存分别是多少?
ZeRO-1
from fairscale.optim.oss import OSS
from torch.nn.parallel import DistributedDataParallel as DDP
base_optimizer_arguments = {'lr':0.05}
base_optimizer = torch.optim.Adam
optimizer = OSS(
params=model.parameters(),
optim=base_optimizer,
**base_optimizer_arguments)
model = DDP(model)
ZeRO-2
from fairscale.optim.oss import OSS
from fairscale.nn.data_parallel import ShardedDataParallel as SDP
base_optimizer_arguments = {'lr':0.05}
base_optimizer = torch.optim.Adam
optimizer = OSS(
params=model.parameters(),
optim=base_optimizer,
**base_optimizer_arguments)
model = SDP(model, optimizer)
ZeRO-3
from fairscale.optim.oss import OSS
from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
optimizer = optim.Adam(model.parameters(), lr=0.05)
model = FSDP(model, mixed_precision=True)
下面实验是基于上面的1.6.1实验代码进行的
将整个模型包在一个FSDP
model = FSDP(model)
将每个参数分别包一个FSDP
model.embed = FSDP(model.embed)
model.linear = FSDP(model.linear)
model.fn = FSDP(model.fn)
model.classifier = FSDP(model.classifier)
选答题:怎么将模型参数恢复呢?
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。