赞
踩
在spark 3.2.0版本以下,如果python的udf函数,在运行时候崩溃了,引发了 segmentation fault 异常时候,spark executor的错误日志,很模糊的只显示了一行日志:
python worker exited unexpectedly (crashed)
因为进程coredump时候,常规语言层面的try catch异常是无法捕捉的,这对排查问题,非常不友好, 这个问题在spark 3.2版本已经得到修复,具体issue参考:[SPARK-36062] Try to capture faulthanlder when a Python worker crashes. - ASF JIRA
在低于3.2.0版本的spark里面,可以把这个特性移值过来,我在spark 3.0.1版本里面尝试去Cherry-Pick合并,发现有很多冲突,最后为了稳妥,还是选择了手动合并,这样以来,如果python进程再崩溃,我们看到上面的executor的错误日志,就会变成如下的非常详细的日志:
- 23/02/28 18:44:06 ERROR Executor: Exception in task 2.0 in stage 4.0 (TID 8)
- org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
-
- Current thread 0x00007f247aafe740 (most recent call first):
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 922 in create_module
- File "<frozen importlib._bootstrap>", line 571 in module_from_spec
- File "<frozen importlib._bootstrap>", line 658 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 684 in _load
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 343 in load_dynamic
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 243 in load_module
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24 in swig_import_helper
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/__init__.py", line 24 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 5 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/load_backend.py", line 90 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/__init__.py", line 1 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/conv_utils.py", line 9 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/__init__.py", line 6 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/__init__.py", line 3 in <module>
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap_external>", line 678 in exec_module
- File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
- File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
- File "<frozen importlib._bootstrap>", line 941 in _find_and_load_unlocked
- File "<frozen importlib._bootstrap>", line 971 in _find_and_load
- File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 908 in subimport
这样排查问题就很方便了,可以清晰的看到是那个依赖库导致的
之所以能够捕捉segmentation fault进程崩溃异常,是利用了python 3.3版本之后的新功能 faulthandler 库,当故障、超时或收到用户信号时,利用本模块内的函数可转储 Python 跟踪信息。
Python官网一个小例子:
- python3 -c "import ctypes; ctypes.string_at(0)"
- Segmentation fault
-
- python3 -q -X faulthandler
- >>> import ctypes
- >>> ctypes.string_at(0)
- Fatal Python error: Segmentation fault
-
- Current thread 0x00007fb899f39700 (most recent call first):
- File "/home/python/cpython/Lib/ctypes/__init__.py", line 486 in string_at
- File "<stdin>", line 1 in <module>
- Segmentation fault
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。