当前位置:   article > 正文

MindSpore训练大模型报错:BrokenPipeError: [Errno 32] Broken pipe, EOFError_mindspore brokenpipeerror 32

mindspore brokenpipeerror 32

1.系统环境
硬件环境(Ascend/GPU/CPU): Ascend
执行模式:静态图 ms2.1.1
Python版本:3.7
操作系统平台:Linux


2. 报错信息
  2.1 问题描述

   使用MindSpore跑大模型报以下错误:

  1. Exception in thread Thread-1:
  2. Traceback (most recent call last):
  3. File "/usr/local/python3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
  4. Exception in thread Thread-2:
  5. Traceback (most recent call last):
  6. File "/usr/local/python3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
  7. self.run()
  8. File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
  9. self.run()
  10. File "/usr/local/python3.9/lib/python3.9/threading.py", line 910, in run
  11. key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
  12. File "<string>", line 2, in get
  13. self._target(*self._args, **self._kwargs)
  14. File "/usr/local/python3.9/lib/python3.9/multiprocessing/pool.py", line 513, in _handle_workers
  15. cls._maintain_pool(ctx, Process, processes, pool, inqueue,
  16. File "/usr/local/python3.9/lib/python3.9/multiprocessing/pool.py", line 337, in _maintain_pool
  17. Pool._repopulate_pool_static(ctx, Process, processes, pool,
  18. File "/usr/local/python3.9/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
  19. File "/usr/local/python3.9/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
  20. w.start()
  21. File "/usr/local/python3.9/lib/python3.9/multiprocessing/process.py", line 121, in start
  22. self._popen = self._Popen(self)
  23. File "/usr/local/python3.9/lib/python3.9/multiprocessing/context.py", line 291, in _Popen
  24. conn.send((self._id, methodname, args, kwds))
  25. File "/usr/local/python3.9/lib/python3.9/multiprocessing/connection.py", line 211, in send
  26. return Popen(process_obj)
  27. File "/usr/local/python3.9/lib/python3.9/multiprocessing/popen_forkserver.py", line 35, in __init__
  28. self._send_bytes(_ForkingPickler.dumps(obj))
  29. File "/usr/local/python3.9/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
  30. super().__init__(process_obj)
  31. File "/usr/local/python3.9/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
  32. self._send(header + buf)
  33. File "/usr/local/python3.9/lib/python3.9/multiprocessing/connection.py", line 373, in _send
  34. self._launch(process_obj)
  35. File "/usr/local/python3.9/lib/python3.9/multiprocessing/popen_forkserver.py", line 58, in _launch
  36. n = write(self._handle, buf)
  37. BrokenPipeError: [Errno 32] Broken pipe
  38. f.write(buf.getbuffer())
  39. BrokenPipeError: [Errno 32] Broken pipe
  40. Exception in thread Thread-1:
  41. Traceback (most recent call last):
  42. File "/usr/local/python3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
  43. self.run()
  44. File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
  45. Exception in thread Thread-1:
  46. Traceback (most recent call last):
  47. File "/usr/local/python3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
  48. Exception in thread Thread-1:
  49. Traceback (most recent call last):
  50. File "/usr/local/python3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
  51. self.run()
  52. File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run
  53. key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
  54. File "<string>", line 2, in get
  55. File "/usr/local/python3.9/lib/python3.9/multiprocessing/managers.py", line 810, in _callmethod
  56. kind, result = conn.recv()
  57. File "/usr/local/python3.9/lib/python3.9/multiprocessing/connection.py", line 255, in recv
  58. buf = self._recv_bytes()
  59. File "/usr/local/python3.9/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
  60. buf = self._recv(4)
  61. File "/usr/local/python3.9/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
  62. raise EOFError
  63. EOFError
复制

3.解决方法

预训练模型太大,导致在加载模型的时候host内存消耗完毕,系统会选择性清理一些进程,以释放一些被占用的内存,导致报此错误。

建议排查方向:host内存是否占满,以及内存耗光的原因。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/989457
推荐阅读
相关标签
  

闽ICP备14008679号