当前位置:   article > 正文

跑RayJob遇到的问题5:ConnectTimeoutError_connectionerror: ray client connection timeout

connectionerror: ray client connection timeout

基于KubeRay提交RayJob

0.背景

基于kuberay-operator 0.4.0和0.5.1版本,都有次问题

1.问题

提交作业

 kubectl apply -f ray_v1alpha1_rayjob.yaml
  • 1

问题报错:

  Message:                runtime_env setup failed: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 358, in _create_runtime_env_with_retry
    runtime_env_setup_task, timeout=setup_timeout_seconds
  File "/home/ray/anaconda3/lib/python3.7/asyncio/tasks.py", line 442, in wait_for
    return fut.result()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 313, in _setup_runtime_env
    runtime_env, plugin, uri_cache, context, per_job_logger
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/plugin.py", line 252, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 473, in create
    return await task
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 458, in _create_for_hash
    logger,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 366, in _run
    logger,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 337, in _install_pip_packages
    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/utils.py", line 102, in check_output_cmd
    proc.returncode, cmd, output=stdout, cmd_index=cmd_index
ray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.
Command '['/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/requirements.txt']' returned non-zero exit status 1.
Last 50 lines of stdout:
    WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41290>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/
    WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41590>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/
    WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41950>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/
    WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41c90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/
    WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fada5c090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/
    ERROR: Could not find a version that satisfies the requirement requests==2.26.0 (from versions: none)
    ERROR: No matching distribution found for requests==2.26.0
	
	
	
	2023-05-10T12:26:28.136Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:26:28.139Z        DEBUG   controllers.RayJob      RayJob information      {"RayJob": "rayjob-sample", "jobInfo": {"status":"FAILED","entrypoint":"python /home/ray/samples/sample_code.py","message":"runtime_env setup failed: Failed to set up runtime environment.\nCould not create the actor because its associated runtime env failed to be created.\nTraceback (most recent call last):\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 358, in _create_runtime_env_with_retry\n    runtime_env_setup_task, timeout=setup_timeout_seconds\n  File \"/home/ray/anaconda3/lib/python3.7/asyncio/tasks.py\", line 442, in wait_for\n    return fut.result()\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 313, in _setup_runtime_env\n    runtime_env, plugin, uri_cache, context, per_job_logger\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/plugin.py\", line 252, in create_for_plugin_if_needed\n    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 473, in create\n    return await task\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 458, in _create_for_hash\n    logger,\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 366, in _run\n    logger,\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 337, in _install_pip_packages\n    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/utils.py\", line 102, in check_output_cmd\n    proc.returncode, cmd, output=stdout, cmd_index=cmd_index\nray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.\nCommand '['/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/requirements.txt']' returned non-zero exit status 1.\nLast 50 lines of stdout:\n    WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41290>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41590>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41950>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41c90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fada5c090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    ERROR: Could not find a version that satisfies the requirement requests==2.26.0 (from versions: none)\n    ERROR: No matching distribution found for requests==2.26.0\n","start_time":1683721501109,"end_time":1683721586007}, "rayJobInstance": "PENDING"}
2023-05-10T12:26:28.139Z        INFO    controllers.RayJob      Update status from PENDING to FAILED    {"rayjob": "rayjob-sample-q64fr"}
2023-05-10T12:26:28.139Z        INFO    controllers.RayJob      UpdateState     {"oldJobStatus": "PENDING", "newJobStatus": "FAILED", "oldJobDeploymentStatus": "Running", "newJobDeploymentStatus": "Running"}
2023-05-10T12:26:28.194Z        INFO    controllers.RayJob      reconciling RayJob      {"NamespacedName": "default/rayjob-sample"}
2023-05-10T12:26:28.194Z        INFO    controllers.RayJob      RayJob associated rayCluster found      {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"}
2023-05-10T12:26:28.197Z        DEBUG   controllers.RayJob      RayJob information      {"RayJob": "rayjob-sample", "jobInfo": {"status":"FAILED","entrypoint":"python /home/ray/samples/sample_code.py","message":"runtime_env setup failed: Failed to set up runtime environment.\nCould not create the actor because its associated runtime env failed to be created.\nTraceback (most recent call last):\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 358, in _create_runtime_env_with_retry\n    runtime_env_setup_task, timeout=setup_timeout_seconds\n  File \"/home/ray/anaconda3/lib/python3.7/asyncio/tasks.py\", line 442, in wait_for\n    return fut.result()\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 313, in _setup_runtime_env\n    runtime_env, plugin, uri_cache, context, per_job_logger\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/plugin.py\", line 252, in create_for_plugin_if_needed\n    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 473, in create\n    return await task\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 458, in _create_for_hash\n    logger,\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 366, in _run\n    logger,\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 337, in _install_pip_packages\n    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/utils.py\", line 102, in check_output_cmd\n    proc.returncode, cmd, output=stdout, cmd_index=cmd_index\nray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.\nCommand '['/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/requirements.txt']' returned non-zero exit status 1.\nLast 50 lines of stdout:\n    WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41290>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41590>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41950>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41c90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fada5c090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n    ERROR: Could not find a version that satisfies the requirement requests==2.26.0 (from versions: none)\n    ERROR: No matching distribution found for requests==2.26.0\n","start_time":1683721501109,"end_time":1683721586007}, "rayJobInstance": "FAILED"}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42

2.分析

由于公司网络限制,k8s集群的容器内没有连接外网的权限

3.解决方案

方案1

打通容器内的网络,配置外网访问的代理

方案2

去掉pip install 依赖库的步骤,但这个仅适合不用安装依赖包的场景,如果需要依赖包还是得采取方案1

修改runtime,去掉pip依赖包安装,runtimeEnv为base64编码

{
    "pip": [ ],
    "env_vars": {"counter_name": "test_counter"}
}

  • 1
  • 2
  • 3
  • 4
  • 5

base64之后为:

ewogICAgInBpcCI6IFsgXSwKICAgICJlbnZfdmFycyI6IHsiY291bnRlcl9uYW1lIjogInRlc3RfY291bnRlciJ9Cn0K
  • 1

修改处:
在这里插入图片描述

再次提交就成功了

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob# kubectl apply -f ray_v1alpha1_rayjob.yaml
rayjob.ray.io/rayjob-sample created
configmap/ray-job-code-sample created
  • 1
  • 2
  • 3
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/586619
推荐阅读
相关标签
  

闽ICP备14008679号