当前位置:   article > 正文

【debug记录】ChatGLM2-6b微调中遇到的问题_torch.distributed.distnetworkerror: connection res

torch.distributed.distnetworkerror: connection reset by peer

(1)bug:运行ptuning/train.sh报错,报错点 torch.distributed.TCPStore,报错 RuntimeError: Connection reset by peer

fix:在报错点发现获取到的 master_addr 为自定义的主机名而非 localhost,导致网络连接错误,故将 ~/anaconda3/envs/chatglm2-6b/lib/python3.10/site-packages/torch/distributed/rendezvous.py 的244行将 master_addr 变量强制设为127.0.0.1而非自定义的主机名。(参考 MASTER_ADDR and MASTER_PORT · Issue #43207 · pytorch/pytorch · GitHub 和 Runtime error: connection reset by peer in init_process_group - distributed - PyTorch Forums

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/817571
推荐阅读
相关标签
  

闽ICP备14008679号