当前位置:   article > 正文

Deepspeed 结合huggingface Trainer实现多机分布式训练_deepspeed多机多卡

deepspeed多机多卡

目前工作中只使用了单机多卡做微调训练,为了提升训练效率,特实验多机多卡分布式训练。

一、环境准备

本试验使用两台机器(manager,worker),操作系统ubuntu 22.4,每台机器有4个GPU

为了使安装配置统一,使用docker容器,docker 的安装这里不做介绍。

1.网络配置-创建overlay共享网络

初始化集群,在manager机器上运行:

  1. docker swarm init
  2. #输出结果:
  3. Swarm initialized: current node (k4ehuhg4a2umpjoo7yovy1caf) is now a manager.
  4. To add a worker to this swarm, run the following command:
  5. docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377
  6. To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

加入集群,在worker机器上运行:

docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377

在 manager 中创建 overlay 网络,执行命令:

docker network create --driver=overlay --attachable test-net

执行命令docker network ls查看当前网络状态,可以看到最后一行,已经创建好了

  1. NETWORK ID NAME DRIVER SCOPE
  2. ec8c853e521d bridge bridge local
  3. 72574615b63f docker_gwbridge bridge local
  4. 9fbe2f6c3b22 freeaskinternet_default bridge local
  5. b8273bdcc836 host host local
  6. ii71ul2agult ingress overlay swarm
  7. eadcc6c24a81 none null local
  8. fxnzpd6r1hr0 sharednet overlay swarm
  9. wdoj2fcw29np test-net overlay swarm

2.安装docker-compose

  1. sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
  2. sudo chmod +x /usr/bin/docker-compose
  3. docker-compose --version

3.创建工作目录work

  1. mkdir work
  2. cd work

4.在work中创建文件,Dockerfile

  1. #Dockerfile
  2. from nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu22.04
  3. # 更新系统包
  4. RUN apt-get update && apt-get install -y git build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libsqlite3-dev libreadline-dev libffi-dev liblzma-dev libbz2-dev curl wget net-tools iputils-ping pdsh
  5. # 安装Python
  6. WORKDIR /home/user
  7. RUN wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tgz && \
  8. tar -zvxf Python-3.10.6.tgz && cd Python-3.10.6 && \
  9. ./configure --enable-optimizations && make -j 4 && make install

5.在work中创建文件docker-compose.yml

  1. version: "3"
  2. services:
  3. llmtrain:
  4. build:
  5. context: .
  6. dockerfile: Dockerfile
  7. container_name: llmtrain
  8. tty: true
  9. restart: always
  10. ulimits:
  11. memlock: -1
  12. stack: 67108864
  13. shm_size: 40G
  14. deploy:
  15. resources:
  16. reservations:
  17. devices:
  18. - capabilities: [gpu]
  19. volumes:
  20. - ./code:/home/user/code:cached
  21. networks:
  22. - test-net
  23. networks:
  24. test-net:
  25. external: true

6.构建docker

sudo docker-compose up -d --build

7.进入容器

sudo docker exec -it <容器ID> /bin/bash

8.查看网络

  1. ifconfig -a
  2. eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
  3. inet 10.0.1.14 netmask 255.255.255.0 broadcast 10.0.1.255
  4. ether 02:42:0a:00:01:0e txqueuelen 0 (Ethernet)
  5. RX packets 2170444797 bytes 11730029590467 (11.7 TB)
  6. RX errors 0 dropped 0 overruns 0 frame 0
  7. TX packets 1371803017 bytes 11419623920546 (11.4 TB)
  8. TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
  9. eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
  10. inet 172.18.0.3 netmask 255.255.0.0 broadcast 172.18.255.255
  11. ether 02:42:ac:12:00:03 txqueuelen 0 (Ethernet)
  12. RX packets 74646 bytes 395241942 (395.2 MB)
  13. RX errors 0 dropped 0 overruns 0 frame 0
  14. TX packets 44728 bytes 3336632 (3.3 MB)
  15. TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
  16. lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
  17. inet 127.0.0.1 netmask 255.0.0.0
  18. loop txqueuelen 1000 (Local Loopback)
  19. RX packets 161709 bytes 15509786 (15.5 MB)
  20. RX errors 0 dropped 0 overruns 0 frame 0
  21. TX packets 161709 bytes 15509786 (15.5 MB)
  22. TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

9.验证网络是否通

分别进入两个容器中查看ip地址后,互ping 一下对方看网络是否正常。

10.安装工程需要用到的库

  1. pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
  2. pip3 install deepspeed

注从2-10步需要在另一台机器worker上也执行一遍

11.配置ssh 免密登录

安装openssh-server服务

首先分别去manager,worker节点的容器中安装openssh-server服务并启动

  1. # 安装ssh服务
  2. apt-get install openssh-server -y
  3. # 启动ssh服务
  4. /etc/init.d/ssh start

配置免密登录

注意:以下操作都是在manager,worker节点容器内部

分别去manager,worker节点的容器中执行 ssh-keygen -t rsa 命令,一直回车即可。

ssh-keygen -t rsa
  • 将manager节点中的~/.ssh/id_rsa.pub的内容复制写入到manager节点和worker节点中的~/.ssh/authorized_keys文件中。
  • 将worker节点中的~/.ssh/id_rsa.pub的内容复制写入到manager节点和worker节点中的~/.ssh/authorized_keys文件中。

在复制文件内容时要注意一下回车换行符。

接着分别去manager,worker节点的/etc/hosts文件中增加映射

  1. 10.0.1.14 worker
  2. 10.0.1.16 manager

最后测试容器之间是否可以免密登录,如果配置正确,应该不需要输入密码就可以登录。

  1. ssh manager
  2. ssh worker

12.配置NCCL相关环境变量

在~/.bashrc文件中增加以下内容:

  1. #需要注意NCCL的配置,这里需要根据机器的情况指定NCCL的通讯网卡,这里用的是eth0可以通过ifconfig -a命令查看
  2. export NCCL_SOCKET_IFNAME=eth0
  3. export NCCL_DEBUG=INFO

然后不要忘了 source .bashrc 让其生效,worker节点要执行同样的操作。

二、分布式训练

本试验使用的是bloom-7B模型。

由于huggingface trainer对deepspeed的支持非常友好,只需要一个配置参数即可:

1.准备配置文件hostfile,ds_config_1.json

  1. #slots表示对应机器上可供使用的gpu数量
  2. manager slots=4
  3. worker slots=4

deepspeed_config对应文件ds_config_1.json的内容:

  1. {
  2. "train_batch_size": "auto",
  3. "train_micro_batch_size_per_gpu": "auto",
  4. "gradient_accumulation_steps": "auto",
  5. "gradient_clipping": "auto",
  6. "zero_allow_untested_optimizer": true,
  7. "bfloat16": {
  8. "enabled": false
  9. },
  10. "fp16": {
  11. "enabled": "auto",
  12. "loss_scale": 0,
  13. "loss_scale_window": 1000,
  14. "initial_scale_power": 16,
  15. "hysteresis": 2,
  16. "min_loss_scale": 1
  17. },
  18. "optimizer": {
  19. "type": "AdamW",
  20. "params": {
  21. "lr": "auto",
  22. "betas": "auto",
  23. "eps": "auto",
  24. "weight_decay": "auto"
  25. }
  26. },
  27. "zero_optimization": {
  28. "stage": 1,
  29. "reduce_bucket_size": 5e8
  30. },
  31. "steps_per_print": 10,
  32. "wall_clock_breakdown": false,
  33. "checkpoint": {
  34. "use_node_local_storage": true
  35. }
  36. }

本试验zeRO使用了stage 1.在实际使用过程中请结合模型大小以及gpu情况做合适的配置。

具体内容可以参考:Zero Redundancy Optimizer - DeepSpeed

2.启动训练

deepspeed --hostfile hostfile finetune.py --deepspeed ./ds_config_1.json

3.结果

  1. oot@b50557cdc89c:/home/user/code# nvidia-smi
  2. Wed May 29 02:08:43 2024
  3. +-----------------------------------------------------------------------------+
  4. | NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.2 |
  5. |-------------------------------+----------------------+----------------------+
  6. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  7. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  8. | | | MIG M. |
  9. |===============================+======================+======================|
  10. | 0 NVIDIA A800 80G... Off | 00000000:34:00.0 Off | 0 |
  11. | N/A 56C P0 83W / 300W | 79577MiB / 81920MiB | 100% Default |
  12. | | | Disabled |
  13. +-------------------------------+----------------------+----------------------+
  14. | 1 NVIDIA A800 80G... Off | 00000000:35:00.0 Off | 0 |
  15. | N/A 54C P0 78W / 300W | 80555MiB / 81920MiB | 100% Default |
  16. | | | Disabled |
  17. +-------------------------------+----------------------+----------------------+
  18. | 2 NVIDIA A800 80G... Off | 00000000:9D:00.0 Off | 0 |
  19. | N/A 62C P0 99W / 300W | 80379MiB / 81920MiB | 100% Default |
  20. | | | Disabled |
  21. +-------------------------------+----------------------+----------------------+
  22. | 3 NVIDIA A800 80G... Off | 00000000:9E:00.0 Off | 0 |
  23. | N/A 59C P0 91W / 300W | 80763MiB / 81920MiB | 100% Default |
  24. | | | Disabled |
  25. +-------------------------------+----------------------+----------------------+
  26. +-----------------------------------------------------------------------------+
  27. | Processes: |
  28. | GPU GI CI PID Type Process name GPU Memory |
  29. | ID ID Usage |
  30. |=============================================================================|
  31. +-----------------------------------------------------------------------------+

三、结论

deepspeed多机分布式训练若NCCL使用socket通信,速度非常慢,还不如单机多卡速度快!建议使用IB通信,但需要有相关硬件支撑。

四、nccl-test

经过nccl-test测试发现多机nccl基于socket通信的速度是单机的十分之一。多机的速度是0.3G/s,单机的是4G/s.

  1. #安装NCCL,NCCL已经支持软件源安装
  2. apt install libnccl2 libnccl-dev
  3. #查看是否安装成功
  4. ldconfig -p | grep libnccl
  5. #安装mpich
  6. apt-get install mpich
  7. #安装nccl-test
  8. #下载:https://github.com/nvidia/nccl-tests或
  9. git clone https://github.com/NVIDIA/nccl-tests.git
  10. cd nccl-tests
  11. make mpi=1
  12. #测试单机
  13. #mpi方式
  14. mpirun -np 4 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
  15. ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
  16. #mpi测试多机
  17. mpirun -np 8 -hosts manager,worker -map-by slot -env NCCL_DEBUG INFO -env NCCL_SOCKET_IFNAME eth0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

五、参考资料

docker容器中deepspeed多机多卡集群分布式训练大模型 - 简书

当你拿到一台GPU服务器后,你要做什么 - 知乎

[docker]nvidia的cuda镜像列表-CSDN博客

https://github.com/NVIDIA/nccl/issues/318

DistributedDataParallel on multiple GPU nodes slower than one GPU node - #2 by mrshenli - PyTorch Forums

nccl-test 使用指引-腾讯云开发者社区-腾讯云

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/875631
推荐阅读
相关标签
  

闽ICP备14008679号