赞
踩
基于KubeRay提交RayJob
基于kuberay-operator 0.4.0和0.5.1版本,都有次问题
kubectl apply -f ray_v1alpha1_rayjob.yaml
Message: runtime_env setup failed: Failed to set up runtime environment. Could not create the actor because its associated runtime env failed to be created. Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 358, in _create_runtime_env_with_retry runtime_env_setup_task, timeout=setup_timeout_seconds File "/home/ray/anaconda3/lib/python3.7/asyncio/tasks.py", line 442, in wait_for return fut.result() File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 313, in _setup_runtime_env runtime_env, plugin, uri_cache, context, per_job_logger File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/plugin.py", line 252, in create_for_plugin_if_needed size_bytes = await plugin.create(uri, runtime_env, context, logger=logger) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 473, in create return await task File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 458, in _create_for_hash logger, File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 366, in _run logger, File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py", line 337, in _install_pip_packages await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/utils.py", line 102, in check_output_cmd proc.returncode, cmd, output=stdout, cmd_index=cmd_index ray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details. Command '['/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/requirements.txt']' returned non-zero exit status 1. Last 50 lines of stdout: WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41290>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/ WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41590>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/ WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41950>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/ WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41c90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/ WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fada5c090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/ ERROR: Could not find a version that satisfies the requirement requests==2.26.0 (from versions: none) ERROR: No matching distribution found for requests==2.26.0 2023-05-10T12:26:28.136Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"} 2023-05-10T12:26:28.139Z DEBUG controllers.RayJob RayJob information {"RayJob": "rayjob-sample", "jobInfo": {"status":"FAILED","entrypoint":"python /home/ray/samples/sample_code.py","message":"runtime_env setup failed: Failed to set up runtime environment.\nCould not create the actor because its associated runtime env failed to be created.\nTraceback (most recent call last):\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 358, in _create_runtime_env_with_retry\n runtime_env_setup_task, timeout=setup_timeout_seconds\n File \"/home/ray/anaconda3/lib/python3.7/asyncio/tasks.py\", line 442, in wait_for\n return fut.result()\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 313, in _setup_runtime_env\n runtime_env, plugin, uri_cache, context, per_job_logger\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/plugin.py\", line 252, in create_for_plugin_if_needed\n size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 473, in create\n return await task\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 458, in _create_for_hash\n logger,\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 366, in _run\n logger,\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 337, in _install_pip_packages\n await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/utils.py\", line 102, in check_output_cmd\n proc.returncode, cmd, output=stdout, cmd_index=cmd_index\nray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.\nCommand '['/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/requirements.txt']' returned non-zero exit status 1.\nLast 50 lines of stdout:\n WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41290>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41590>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41950>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41c90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fada5c090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n ERROR: Could not find a version that satisfies the requirement requests==2.26.0 (from versions: none)\n ERROR: No matching distribution found for requests==2.26.0\n","start_time":1683721501109,"end_time":1683721586007}, "rayJobInstance": "PENDING"} 2023-05-10T12:26:28.139Z INFO controllers.RayJob Update status from PENDING to FAILED {"rayjob": "rayjob-sample-q64fr"} 2023-05-10T12:26:28.139Z INFO controllers.RayJob UpdateState {"oldJobStatus": "PENDING", "newJobStatus": "FAILED", "oldJobDeploymentStatus": "Running", "newJobDeploymentStatus": "Running"} 2023-05-10T12:26:28.194Z INFO controllers.RayJob reconciling RayJob {"NamespacedName": "default/rayjob-sample"} 2023-05-10T12:26:28.194Z INFO controllers.RayJob RayJob associated rayCluster found {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-9jtn6"} 2023-05-10T12:26:28.197Z DEBUG controllers.RayJob RayJob information {"RayJob": "rayjob-sample", "jobInfo": {"status":"FAILED","entrypoint":"python /home/ray/samples/sample_code.py","message":"runtime_env setup failed: Failed to set up runtime environment.\nCould not create the actor because its associated runtime env failed to be created.\nTraceback (most recent call last):\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 358, in _create_runtime_env_with_retry\n runtime_env_setup_task, timeout=setup_timeout_seconds\n File \"/home/ray/anaconda3/lib/python3.7/asyncio/tasks.py\", line 442, in wait_for\n return fut.result()\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py\", line 313, in _setup_runtime_env\n runtime_env, plugin, uri_cache, context, per_job_logger\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/plugin.py\", line 252, in create_for_plugin_if_needed\n size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 473, in create\n return await task\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 458, in _create_for_hash\n logger,\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 366, in _run\n logger,\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/pip.py\", line 337, in _install_pip_packages\n await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)\n File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env/utils.py\", line 102, in check_output_cmd\n proc.returncode, cmd, output=stdout, cmd_index=cmd_index\nray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.\nCommand '['/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2023-05-10_05-14-44_955106_9/runtime_resources/pip/8e854faf362176fed606a9d5bd769405d9af9422/requirements.txt']' returned non-zero exit status 1.\nLast 50 lines of stdout:\n WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41290>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41590>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41950>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fadc41c90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f1fada5c090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/requests/\n ERROR: Could not find a version that satisfies the requirement requests==2.26.0 (from versions: none)\n ERROR: No matching distribution found for requests==2.26.0\n","start_time":1683721501109,"end_time":1683721586007}, "rayJobInstance": "FAILED"}
由于公司网络限制,k8s集群的容器内没有连接外网的权限
打通容器内的网络,配置外网访问的代理
去掉pip install 依赖库的步骤,但这个仅适合不用安装依赖包的场景,如果需要依赖包还是得采取方案1
修改runtime,去掉pip依赖包安装,runtimeEnv为base64编码,
{
"pip": [ ],
"env_vars": {"counter_name": "test_counter"}
}
base64之后为:
ewogICAgInBpcCI6IFsgXSwKICAgICJlbnZfdmFycyI6IHsiY291bnRlcl9uYW1lIjogInRlc3RfY291bnRlciJ9Cn0K
修改处:
再次提交就成功了
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob# kubectl apply -f ray_v1alpha1_rayjob.yaml
rayjob.ray.io/rayjob-sample created
configmap/ray-job-code-sample created
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。