赞
踩
目前工作中只使用了单机多卡做微调训练,为了提升训练效率,特实验多机多卡分布式训练。
本试验使用两台机器(manager,worker),操作系统ubuntu 22.4,每台机器有4个GPU
为了使安装配置统一,使用docker容器,docker 的安装这里不做介绍。
初始化集群,在manager机器上运行:
- docker swarm init
-
- #输出结果:
- Swarm initialized: current node (k4ehuhg4a2umpjoo7yovy1caf) is now a manager.
-
- To add a worker to this swarm, run the following command:
-
- docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377
-
- To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
加入集群,在worker机器上运行:
docker swarm join --token SWMTKN-1-686yitjn5p5twd3b3pzezqofd8dlk1wm6juqo3xb5bj4xzztvh-15obj4grc8p8mqul8qvmupkdi 192.168.11.11:2377
在 manager 中创建 overlay 网络,执行命令:
docker network create --driver=overlay --attachable test-net
执行命令docker network ls查看当前网络状态,可以看到最后一行,已经创建好了
- NETWORK ID NAME DRIVER SCOPE
- ec8c853e521d bridge bridge local
- 72574615b63f docker_gwbridge bridge local
- 9fbe2f6c3b22 freeaskinternet_default bridge local
- b8273bdcc836 host host local
- ii71ul2agult ingress overlay swarm
- eadcc6c24a81 none null local
- fxnzpd6r1hr0 sharednet overlay swarm
- wdoj2fcw29np test-net overlay swarm
- sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
- sudo chmod +x /usr/bin/docker-compose
- docker-compose --version
- mkdir work
- cd work
- #Dockerfile
-
- from nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu22.04
-
- # 更新系统包
- RUN apt-get update && apt-get install -y git build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libsqlite3-dev libreadline-dev libffi-dev liblzma-dev libbz2-dev curl wget net-tools iputils-ping pdsh
-
- # 安装Python
- WORKDIR /home/user
-
- RUN wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tgz && \
- tar -zvxf Python-3.10.6.tgz && cd Python-3.10.6 && \
- ./configure --enable-optimizations && make -j 4 && make install
- version: "3"
- services:
- llmtrain:
- build:
- context: .
- dockerfile: Dockerfile
- container_name: llmtrain
- tty: true
- restart: always
- ulimits:
- memlock: -1
- stack: 67108864
- shm_size: 40G
- deploy:
- resources:
- reservations:
- devices:
- - capabilities: [gpu]
- volumes:
- - ./code:/home/user/code:cached
- networks:
- - test-net
-
- networks:
- test-net:
- external: true
sudo docker-compose up -d --build
sudo docker exec -it <容器ID> /bin/bash
ifconfig -a eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 10.0.1.14 netmask 255.255.255.0 broadcast 10.0.1.255 ether 02:42:0a:00:01:0e txqueuelen 0 (Ethernet) RX packets 2170444797 bytes 11730029590467 (11.7 TB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1371803017 bytes 11419623920546 (11.4 TB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.18.0.3 netmask 255.255.0.0 broadcast 172.18.255.255 ether 02:42:ac:12:00:03 txqueuelen 0 (Ethernet) RX packets 74646 bytes 395241942 (395.2 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 44728 bytes 3336632 (3.3 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 161709 bytes 15509786 (15.5 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 161709 bytes 15509786 (15.5 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
分别进入两个容器中查看ip地址后,互ping 一下对方看网络是否正常。
- pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-
- pip3 install deepspeed
注从2-10步需要在另一台机器worker上也执行一遍
首先分别去manager,worker节点的容器中安装openssh-server服务并启动
- # 安装ssh服务
- apt-get install openssh-server -y
- # 启动ssh服务
- /etc/init.d/ssh start
注意:以下操作都是在manager,worker节点容器内部
分别去manager,worker节点的容器中执行 ssh-keygen -t rsa 命令,一直回车即可。
ssh-keygen -t rsa
在复制文件内容时要注意一下回车换行符。
接着分别去manager,worker节点的/etc/hosts文件中增加映射
- 10.0.1.14 worker
- 10.0.1.16 manager
最后测试容器之间是否可以免密登录,如果配置正确,应该不需要输入密码就可以登录。
- ssh manager
-
- ssh worker
在~/.bashrc文件中增加以下内容:
- #需要注意NCCL的配置,这里需要根据机器的情况指定NCCL的通讯网卡,这里用的是eth0可以通过ifconfig -a命令查看
-
- export NCCL_SOCKET_IFNAME=eth0
- export NCCL_DEBUG=INFO
然后不要忘了 source .bashrc 让其生效,worker节点要执行同样的操作。
本试验使用的是bloom-7B模型。
由于huggingface trainer对deepspeed的支持非常友好,只需要一个配置参数即可:
- #slots表示对应机器上可供使用的gpu数量
- manager slots=4
- worker slots=4
deepspeed_config对应文件ds_config_1.json的内容:
- {
- "train_batch_size": "auto",
- "train_micro_batch_size_per_gpu": "auto",
- "gradient_accumulation_steps": "auto",
- "gradient_clipping": "auto",
- "zero_allow_untested_optimizer": true,
-
- "bfloat16": {
- "enabled": false
- },
- "fp16": {
- "enabled": "auto",
- "loss_scale": 0,
- "loss_scale_window": 1000,
- "initial_scale_power": 16,
- "hysteresis": 2,
- "min_loss_scale": 1
- },
- "optimizer": {
- "type": "AdamW",
- "params": {
- "lr": "auto",
- "betas": "auto",
- "eps": "auto",
- "weight_decay": "auto"
- }
- },
- "zero_optimization": {
- "stage": 1,
- "reduce_bucket_size": 5e8
-
- },
- "steps_per_print": 10,
- "wall_clock_breakdown": false,
- "checkpoint": {
- "use_node_local_storage": true
- }
-
- }
本试验zeRO使用了stage 1.在实际使用过程中请结合模型大小以及gpu情况做合适的配置。
具体内容可以参考:Zero Redundancy Optimizer - DeepSpeed
deepspeed --hostfile hostfile finetune.py --deepspeed ./ds_config_1.json
- oot@b50557cdc89c:/home/user/code# nvidia-smi
- Wed May 29 02:08:43 2024
- +-----------------------------------------------------------------------------+
- | NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.2 |
- |-------------------------------+----------------------+----------------------+
- | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
- | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
- | | | MIG M. |
- |===============================+======================+======================|
- | 0 NVIDIA A800 80G... Off | 00000000:34:00.0 Off | 0 |
- | N/A 56C P0 83W / 300W | 79577MiB / 81920MiB | 100% Default |
- | | | Disabled |
- +-------------------------------+----------------------+----------------------+
- | 1 NVIDIA A800 80G... Off | 00000000:35:00.0 Off | 0 |
- | N/A 54C P0 78W / 300W | 80555MiB / 81920MiB | 100% Default |
- | | | Disabled |
- +-------------------------------+----------------------+----------------------+
- | 2 NVIDIA A800 80G... Off | 00000000:9D:00.0 Off | 0 |
- | N/A 62C P0 99W / 300W | 80379MiB / 81920MiB | 100% Default |
- | | | Disabled |
- +-------------------------------+----------------------+----------------------+
- | 3 NVIDIA A800 80G... Off | 00000000:9E:00.0 Off | 0 |
- | N/A 59C P0 91W / 300W | 80763MiB / 81920MiB | 100% Default |
- | | | Disabled |
- +-------------------------------+----------------------+----------------------+
-
- +-----------------------------------------------------------------------------+
- | Processes: |
- | GPU GI CI PID Type Process name GPU Memory |
- | ID ID Usage |
- |=============================================================================|
- +-----------------------------------------------------------------------------+
deepspeed多机分布式训练若NCCL使用socket通信,速度非常慢,还不如单机多卡速度快!建议使用IB通信,但需要有相关硬件支撑。
经过nccl-test测试发现多机nccl基于socket通信的速度是单机的十分之一。多机的速度是0.3G/s,单机的是4G/s.
- #安装NCCL,NCCL已经支持软件源安装
-
- apt install libnccl2 libnccl-dev
-
- #查看是否安装成功
- ldconfig -p | grep libnccl
-
- #安装mpich
- apt-get install mpich
-
-
- #安装nccl-test
- #下载:https://github.com/nvidia/nccl-tests或
- git clone https://github.com/NVIDIA/nccl-tests.git
-
- cd nccl-tests
- make mpi=1
-
- #测试单机
- #mpi方式
- mpirun -np 4 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
- ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
-
- #mpi测试多机
- mpirun -np 8 -hosts manager,worker -map-by slot -env NCCL_DEBUG INFO -env NCCL_SOCKET_IFNAME eth0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
docker容器中deepspeed多机多卡集群分布式训练大模型 - 简书
[docker]nvidia的cuda镜像列表-CSDN博客
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。