赞
踩
假如你的服务器有 4 GPUs.
首先,确保安装了accelerate命令。没有安装的话执行
pip install accelerate
第二,确保CUDA_VISIBLE_DEVICES命令存在。
第三,直接指定GPU命令
指定任务1为卡0
CUDA_VISIBLE_DEVICES=0 nohup accelerate launch a.py >log.txt &
指定任务2为卡1
CUDA_VISIBLE_DEVICES=1 nohup accelerate launch --main_process_port 20655 a.py >log.txt &
这个方法可以跑成功
其中nohup为守候进程,>为将标准输出打印到日志文件,&为后台进程运行。
====================================================
后面的方法还有问题,会报错
第三,配置一个默认的运行配置文件 default_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 2
第三,配置第二个运行配置文件second_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: false
machine_rank: 0
main_process_ip: null
main_process_port: 20655
main_training_function: main
num_machines: 1
num_processes: 2
第四,运行模型代码
CUDA_VISIBLE_DEVICES=0,1 accelerate launch default_config.yaml train_script.py
或者
CUDA_VISIBLE_DEVICES=0,1 accelerate launch train_script.py
CUDA_VISIBLE_DEVICES=2,3 accelerate launch second_config.yaml train_script.py
#指定process id
CUDA_VISIBLE_DEVICES=2,3 accelerate launch --main_process_port 20655 train_script.py
# pytorch指定GPU和nohup同时使用的时候出错”no such directory or file”
CUDA_VISIBLE_DEVICES=0 nohup python -u main.py >log.txt &
#注意CUDA_VISIBLE_DEVICES在nohup前面
The same script can be run in any of the following configurations:
To run it in each of these various modes, use the following commands:
python ./nlp_example.py
cpu=True
to the Accelerator
. python ./nlp_example.py --cpu
accelerate launch --cpu ./nlp_example.py
python ./nlp_example.py # from a server with a GPU
fp16=True
to the Accelerator
. python ./nlp_example.py --fp16
accelerate launch --fp16 ./nlp_example.py
accelerate config # This will create a config file on your server accelerate launch ./nlp_example.py # This will run the script on your server
python -m torch.distributed.launch --nproc_per_node 2 --use_env ./nlp_example.py
accelerate config # This will create a config file on each server accelerate launch ./nlp_example.py # This will run the script on each server
python -m torch.distributed.launch --nproc_per_node 2 \ --use_env \ --node_rank 0 \ --master_addr master_node_ip_address \ ./nlp_example.py # On the first server python -m torch.distributed.launch --nproc_per_node 2 \ --use_env \ --node_rank 1 \ --master_addr master_node_ip_address \ ./nlp_example.py # On the second server
accelerate config # This will create a config file on your TPU server accelerate launch ./nlp_example.py # This will run the script on each server
xmp.spawn
line in your script as you usually do.Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。