赞
踩
Ray job至少有三种情况:
第一种:先起ray集群,再往运行中的ray集群提交作业:https://docs.ray.io/en/latest/cluster/running-applications/job-submission/cli.html#
第二种:部署kuberay-operator,生成RayJob的kubernetes自定义CR,然后提交RayJob:https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayjob.md
第三种:ray集成volcano(使用queue和podgroup):https://github.com/ray-project/kuberay/blob/master/docs/guidance/volcano-integration.md
主要代码提交:
https://github.com/ray-project/kuberay/pull/755/files#diff-58d661db0d2307b4cd362b636f6e8753ac85b0dbc19c1d91df1f41c6ee5e826b
本篇博客主要是体验第三种
介绍:https://volcano.sh/zh/docs/podgroup/
先有kubernetes集群,本篇博客运行在华为云CCE上,已经有了kubernetes,支持helm插件等
本地安装kubectl和helm等工具
安装kuberay-operator 0.5.1 helm chart, 镜像版本为0.5.0
安装kuberay-apiserver 0.5.1 helm chart, 镜像版本为0.5.0
job使用rayproject/ray:v2.4.0版本
开启batch schdular
helm install kuberay-operator --set batchScheduler.enabled=true
命令:
kubectl apply -f createQueue.yaml
文件内容:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: kuberay-test-queue
spec:
weight: 1
capability:
cpu: 4
memory: 6Gi
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl apply -f createRayCluster.yaml
raycluster.ray.io/test-cluster-0 created
文件内容:
apiVersion: ray.io/v1alpha1 kind: RayCluster metadata: name: test-cluster-0 labels: ray.io/scheduler-name: volcano volcano.sh/queue-name: kuberay-test-queue spec: rayVersion: '2.4.0' headGroupSpec: rayStartParams: {} replicas: 1 serviceType: "ClusterIP" template: spec: imagePullSecrets: - name: default-secret containers: - name: ray-head image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0 resources: limits: cpu: "1" memory: "2Gi" requests: cpu: "1" memory: "2Gi" workerGroupSpecs: - groupName: worker rayStartParams: {} replicas: 2 minReplicas: 2 maxReplicas: 2 template: spec: imagePullSecrets: - name: default-secret containers: - name: ray-head image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0 resources: limits: cpu: "1" memory: "1Gi" requests: cpu: "1" memory: "1Gi"
查看podGroup:
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-1-pg -o yaml apiVersion: scheduling.volcano.sh/v1beta1 kind: PodGroup metadata: creationTimestamp: "2023-05-11T07:05:58Z" generation: 285 name: ray-test-cluster-1-pg namespace: default ownerReferences: - apiVersion: ray.io/v1alpha1 blockOwnerDeletion: true controller: true kind: RayCluster name: test-cluster-1 uid: 1fa1e9e8-0e9c-4f36-a10d-9a88abb32853 resourceVersion: "54132769" selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-1-pg uid: c12a3abd-1dec-41f2-9a9a-2c85e5a74f7a spec: minMember: 3 minResources: cpu: "3" memory: 4Gi queue: kuberay-test-queue status: conditions: - lastTransitionTime: "2023-05-11T12:15:26Z" reason: tasks in gang are ready to be scheduled status: "True" transitionID: f6dd2e6b-b98c-46be-8b0b-57862c2dcd90 type: Scheduled phase: Running running: 3 root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-3-pg -o yaml apiVersion: scheduling.volcano.sh/v1beta1 kind: PodGroup metadata: creationTimestamp: "2023-05-11T12:14:39Z" generation: 4 name: ray-test-cluster-3-pg namespace: default ownerReferences: - apiVersion: ray.io/v1alpha1 blockOwnerDeletion: true controller: true kind: RayCluster name: test-cluster-3 uid: d4878879-5635-459e-8678-ab668abfbd2b resourceVersion: "54132373" selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-3-pg uid: 139d23a7-0260-4eac-9ba9-b8599aab6eab spec: minMember: 3 minResources: cpu: "3" memory: 4Gi queue: kuberay-test-queue status: conditions: - lastTransitionTime: "2023-05-11T12:14:50Z" reason: tasks in gang are ready to be scheduled status: "True" transitionID: f30b3387-e00f-4481-995e-b175a424ea47 type: Scheduled phase: Running running: 3 root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-0-pg -o yaml apiVersion: scheduling.volcano.sh/v1beta1 kind: PodGroup metadata: creationTimestamp: "2023-05-11T04:08:53Z" generation: 465 name: ray-test-cluster-0-pg namespace: default ownerReferences: - apiVersion: ray.io/v1alpha1 blockOwnerDeletion: true controller: true kind: RayCluster name: test-cluster-0 uid: a0233819-6e8e-4555-9d3f-13d5d1b2301a resourceVersion: "54145929" selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-0-pg uid: 29585663-7352-44c5-8ae8-7575f1cb6937 spec: minMember: 3 minResources: cpu: "3" memory: 4Gi queue: kuberay-test-queue status: conditions: - lastTransitionTime: "2023-05-11T12:34:17Z" reason: tasks in gang are ready to be scheduled status: "True" transitionID: 8d0c920b-f503-4f51-afd2-4cd9dca751ae type: Scheduled phase: Running running: 3
已经起了三个ray cluster
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get queue kuberay-test-queue -o yaml apiVersion: scheduling.volcano.sh/v1beta1 kind: Queue metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"kuberay-test-queue"},"spec":{"capability":{"cpu":4,"memory":"6Gi"},"weight":1}} creationTimestamp: "2023-05-11T04:04:44Z" generation: 1 name: kuberay-test-queue resourceVersion: "54132374" selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/kuberay-test-queue uid: dfb0d4af-f899-4c48-ac8c-7bcd5d0016b7 spec: capability: cpu: 4 memory: 6Gi reclaimable: true weight: 1 status: allocated: cpu: "9" memory: 12Gi reservation: {} running: 3 state: Open
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl describe pod test-cluster-4-worker-worker-kvrzt Name: test-cluster-4-worker-worker-kvrzt Namespace: default Priority: 0 Node: 172.18.154.132/172.18.154.132 Start Time: Thu, 11 May 2023 20:37:09 +0800 Labels: app.kubernetes.io/created-by=kuberay-operator app.kubernetes.io/name=kuberay ray.io/cluster=test-cluster-4 ray.io/cluster-dashboard=test-cluster-4-dashboard ray.io/group=worker ray.io/identifier=test-cluster-4-worker ray.io/is-ray-node=yes ray.io/node-type=worker volcano.sh/queue-name=kuberay-test-queue Annotations: kubernetes.io/psp: psp-global ray.io/ft-enabled: false ray.io/health-state: scheduling.k8s.io/group-name: ray-test-cluster-4-pg Status: Pending IP: IPs: <none> Controlled By: RayCluster/test-cluster-4 Init Containers: wait-gcs-ready: Container ID: Image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0 Image ID: Port: <none> Host Port: <none> Command: /bin/bash -lc -- Args: until ray health-check --address test-cluster-4-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; do echo wait for GCS to be ready; sleep 5; done State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Environment: FQ_RAY_IP: test-cluster-4-head-svc.default.svc.cluster.local RAY_IP: test-cluster-4-head-svc Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro) Containers: ray-head: Container ID: Image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0 Image ID: Port: 8080/TCP Host Port: 0/TCP Command: /bin/bash -lc -- Args: ulimit -n 65536; ray start --address=test-cluster-4-head-svc.default.svc.cluster.local:6379 --metrics-export-port=8080 --block --num-cpus=1 --memory=1073741824 State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Limits: cpu: 1 memory: 1Gi Requests: cpu: 1 memory: 1Gi Environment: FQ_RAY_IP: test-cluster-4-head-svc.default.svc.cluster.local RAY_IP: test-cluster-4-head-svc RAY_CLUSTER_NAME: (v1:metadata.labels['ray.io/cluster']) RAY_PORT: 6379 RAY_ADDRESS: test-cluster-4-head-svc.default.svc.cluster.local:6379 RAY_USAGE_STATS_KUBERAY_IN_USE: 1 REDIS_PASSWORD: Mounts: /dev/shm from shared-mem (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: shared-mem: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory SizeLimit: 1Gi default-token-spw4v: Type: Secret (a volume populated by a Secret) SecretName: default-token-spw4v Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 42s volcano Successfully assigned default/test-cluster-4-worker-worker-kvrzt to 172.18.154.132 Normal SuccessfulMountVolume 41s kubelet Successfully mounted volumes for pod "test-cluster-4-worker-worker-kvrzt_default(9ecf1c20-b145-4eae-978b-a6181b5a21c5)" Normal Pulling 41s kubelet Pulling image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0" Normal Pulled 16s kubelet Successfully pulled image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0" in 25.037677108s root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
感觉并没有达到预期的结果,案例来说queue是4核6G,创建第二个ray cluster的时候就会pending,但是创建3个都没有pending,都是running。难道是volcano版本的问题?
kuberay的volcano集成只展示了queue和podGroup的集成,没有展示vocalnojob的集成
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。