赞
踩
- Traceback (most recent call last):
- File "/ssd1/miniconda3/envs/pytorch2.1.2/bin/torchrun", line 33, in <module>
- sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
- File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
- return f(*args, **kwargs)
- File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
- run(args)
- File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
- elastic_launch(
- File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
- return launch_agent(self._config, self._entrypoint, list(args))
- File "/ssd1/miniconda3/envs/pytorch2.1.2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
- raise ChildFailedError(
- torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
- ============================================================
- finetune.py FAILED
- ------------------------------------------------------------
- Failures:
- [1]:
- time : 2024-01-17_14:12:08
- host : aidev02
- rank : 3 (local_rank: 3)
- exitcode : 1 (pid: 65322)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- [2]:
- time : 2024-01-17_14:12:08
- host : aidev02
- rank : 4 (local_rank: 4)
- exitcode : 1 (pid: 65323)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- [3]:
- time : 2024-01-17_14:12:08
- host : aidev02
- rank : 5 (local_rank: 5)
- exitcode : 1 (pid: 65324)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- [4]:
- time : 2024-01-17_14:12:08
- host : aidev02
- rank : 6 (local_rank: 6)
- exitcode : 1 (pid: 65325)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- [5]:
- time : 2024-01-17_14:12:08
- host : aidev02
- rank : 7 (local_rank: 7)
- exitcode : 1 (pid: 65326)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- ------------------------------------------------------------
- Root Cause (first observed failure):
- [0]:
- time : 2024-01-17_14:12:08
- host : aidev02
- rank : 2 (local_rank: 2)
- exitcode : 1 (pid: 65321)
- error_file: <N/A>
- traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
- ============================================================
修改finetune_qlora_ds.sh,设置GPUS_PER_NODE与可使用GPU数相同
GPUS_PER_NODE=2
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。