当前位置:   article > 正文

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

torch.distributed.elastic.multiprocessing.errors.childfailederror:

问题

  1. Traceback (most recent call last):
  2. File "/ssd1/miniconda3/envs/pytorch2.1.2/bin/torchrun", line 33, in <module>
  3. sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  4. File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
  5. return f(*args, **kwargs)
  6. File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
  7. run(args)
  8. File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
  9. elastic_launch(
  10. File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
  11. return launch_agent(self._config, self._entrypoint, list(args))
  12. File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
  13. raise ChildFailedError(
  14. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  15. ============================================================
  16. finetune.py FAILED
  17. ------------------------------------------------------------
  18. Failures:
  19. [1]:
  20. time : 2024-01-17_14:12:08
  21. host : aidev02
  22. rank : 3 (local_rank: 3)
  23. exitcode : 1 (pid: 65322)
  24. error_file: <N/A>
  25. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  26. [2]:
  27. time : 2024-01-17_14:12:08
  28. host : aidev02
  29. rank : 4 (local_rank: 4)
  30. exitcode : 1 (pid: 65323)
  31. error_file: <N/A>
  32. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  33. [3]:
  34. time : 2024-01-17_14:12:08
  35. host : aidev02
  36. rank : 5 (local_rank: 5)
  37. exitcode : 1 (pid: 65324)
  38. error_file: <N/A>
  39. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  40. [4]:
  41. time : 2024-01-17_14:12:08
  42. host : aidev02
  43. rank : 6 (local_rank: 6)
  44. exitcode : 1 (pid: 65325)
  45. error_file: <N/A>
  46. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  47. [5]:
  48. time : 2024-01-17_14:12:08
  49. host : aidev02
  50. rank : 7 (local_rank: 7)
  51. exitcode : 1 (pid: 65326)
  52. error_file: <N/A>
  53. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  54. ------------------------------------------------------------
  55. Root Cause (first observed failure):
  56. [0]:
  57. time : 2024-01-17_14:12:08
  58. host : aidev02
  59. rank : 2 (local_rank: 2)
  60. exitcode : 1 (pid: 65321)
  61. error_file: <N/A>
  62. traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
  63. ============================================================

解决

修改finetune_qlora_ds.sh,设置GPUS_PER_NODE与可使用GPU数相同

GPUS_PER_NODE=2

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/160146
推荐阅读
相关标签
  

闽ICP备14008679号