当前位置:   article > 正文

pyspark 3.0增加python woker进程崩毁时的日志记录_org.apache.spark.sparkexception: python worker exi

org.apache.spark.sparkexception: python worker exited unexpectedly (crashed)

在spark 3.2.0版本以下,如果python的udf函数,在运行时候崩溃了,引发了 segmentation fault 异常时候,spark executor的错误日志,很模糊的只显示了一行日志:

python worker exited unexpectedly (crashed)

因为进程coredump时候,常规语言层面的try catch异常是无法捕捉的,这对排查问题,非常不友好, 这个问题在spark 3.2版本已经得到修复,具体issue参考:[SPARK-36062] Try to capture faulthanlder when a Python worker crashes. - ASF JIRA

在低于3.2.0版本的spark里面,可以把这个特性移值过来,我在spark 3.0.1版本里面尝试去Cherry-Pick合并,发现有很多冲突,最后为了稳妥,还是选择了手动合并,这样以来,如果python进程再崩溃,我们看到上面的executor的错误日志,就会变成如下的非常详细的日志:

  1. 23/02/28 18:44:06 ERROR Executor: Exception in task 2.0 in stage 4.0 (TID 8)
  2. org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
  3. Current thread 0x00007f247aafe740 (most recent call first):
  4. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  5. File "<frozen importlib._bootstrap_external>", line 922 in create_module
  6. File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  7. File "<frozen importlib._bootstrap>", line 658 in _load_unlocked
  8. File "<frozen importlib._bootstrap>", line 684 in _load
  9. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 343 in load_dynamic
  10. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 243 in load_module
  11. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24 in swig_import_helper
  12. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28 in <module>
  13. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  14. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  15. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  16. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  17. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  18. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58 in <module>
  19. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  20. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  21. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  22. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  23. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  24. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  25. File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  26. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49 in <module>
  27. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  28. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  29. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  30. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  31. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  32. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/__init__.py", line 24 in <module>
  33. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  34. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  35. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  36. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  37. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  38. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 5 in <module>
  39. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  40. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  41. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  42. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  43. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  44. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/load_backend.py", line 90 in <module>
  45. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  46. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  47. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  48. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  49. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  50. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/__init__.py", line 1 in <module>
  51. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  52. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  53. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  54. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  55. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  56. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  57. File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  58. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/conv_utils.py", line 9 in <module>
  59. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  60. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  61. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  62. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  63. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  64. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  65. File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  66. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/__init__.py", line 6 in <module>
  67. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  68. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  69. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  70. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  71. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  72. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  73. File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  74. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/__init__.py", line 3 in <module>
  75. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  76. File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  77. File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  78. File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  79. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  80. File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  81. File "<frozen importlib._bootstrap>", line 941 in _find_and_load_unlocked
  82. File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  83. File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 908 in subimport

这样排查问题就很方便了,可以清晰的看到是那个依赖库导致的

之所以能够捕捉segmentation fault进程崩溃异常,是利用了python 3.3版本之后的新功能 faulthandler 库,当故障、超时或收到用户信号时,利用本模块内的函数可转储 Python 跟踪信息。

Python官网一个小例子:

  1. python3 -c "import ctypes; ctypes.string_at(0)"
  2. Segmentation fault
  3. python3 -q -X faulthandler
  4. >>> import ctypes
  5. >>> ctypes.string_at(0)
  6. Fatal Python error: Segmentation fault
  7. Current thread 0x00007fb899f39700 (most recent call first):
  8. File "/home/python/cpython/Lib/ctypes/__init__.py", line 486 in string_at
  9. File "<stdin>", line 1 in <module>
  10. Segmentation fault

感兴趣参考:faulthandler —— 转储 Python 的跟踪信息 — Python 3.11.2 文档

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/562694
推荐阅读
相关标签
  

闽ICP备14008679号