torch.distributed.elastic.multiprocessing.errors.ChildFailedError

作者：IT小白 | 2024-02-28 16:27:09

踩

torch.distributed.elastic.multiprocessing.errors.childfailederror:

问题


Traceback (most recent call last):
  File "/ssd1/miniconda3/envs/pytorch2.1.2/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 65322)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 65323)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 65324)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 65325)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 65326)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-17_14:12:08
  host      : aidev02
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 65321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

解决

修改finetune_qlora_ds.sh，设置GPUS_PER_NODE与可使用GPU数相同

GPUS_PER_NODE=2

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/IT小白/article/detail/160146