赞
踩
(1)bug:运行ptuning/train.sh报错,报错点 torch.distributed.TCPStore,报错 RuntimeError: Connection reset by peer
fix:在报错点发现获取到的 master_addr 为自定义的主机名而非 localhost,导致网络连接错误,故将 ~/anaconda3/envs/chatglm2-6b/lib/python3.10/site-packages/torch/distributed/rendezvous.py 的244行将 master_addr 变量强制设为127.0.0.1而非自定义的主机名。(参考 MASTER_ADDR and MASTER_PORT · Issue #43207 · pytorch/pytorch · GitHub 和 Runtime error: connection reset by peer in init_process_group - distributed - PyTorch Forums)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。