当前位置:   article > 正文

学习和体验Ray on volcano_ray volcano

ray volcano

1.准备

概念:

Ray job至少有三种情况:

第一种:先起ray集群,再往运行中的ray集群提交作业:https://docs.ray.io/en/latest/cluster/running-applications/job-submission/cli.html#
第二种:部署kuberay-operator,生成RayJob的kubernetes自定义CR,然后提交RayJob:https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayjob.md
第三种:ray集成volcano(使用queue和podgroup):https://github.com/ray-project/kuberay/blob/master/docs/guidance/volcano-integration.md
主要代码提交:
https://github.com/ray-project/kuberay/pull/755/files#diff-58d661db0d2307b4cd362b636f6e8753ac85b0dbc19c1d91df1f41c6ee5e826b
本篇博客主要是体验第三种

Volcano概念:

介绍:https://volcano.sh/zh/docs/podgroup/

  • queue是容纳一组podgroup的队列,也是该组podgroup获取集群资源的划分依据
  • podgroup是一组强关联pod的集合,主要用于批处理工作负载场景,比如Tensorflow中的一组ps和worker。它是volcano自定义资源类型。
  • Volcano Job,简称vcjob,是Volcano自定义的Job资源类型。区别于Kubernetes Job,vcjob提供了更多高级功能,如可指定调度器、支持最小运行pod数、 支持task、支持生命周期管理、支持指定队列、支持优先级调度等。Volcano Job更加适用于机器学习、大数据、科学计算等高性能计算场景。

准备:

先有kubernetes集群,本篇博客运行在华为云CCE上,已经有了kubernetes,支持helm插件等
本地安装kubectl和helm等工具

计划:

安装kuberay-operator 0.5.1 helm chart, 镜像版本为0.5.0
安装kuberay-apiserver 0.5.1 helm chart, 镜像版本为0.5.0
job使用rayproject/ray:v2.4.0版本

2. 安装kuberay-operator

开启batch schdular

helm install kuberay-operator --set batchScheduler.enabled=true
  • 1

3.创建queue

命令:

 kubectl apply -f createQueue.yaml
  • 1

文件内容:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: kuberay-test-queue
spec:
  weight: 1
  capability:
    cpu: 4
    memory: 6Gi
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

4. 创建Ray集群

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl apply -f createRayCluster.yaml
raycluster.ray.io/test-cluster-0 created
  • 1
  • 2

文件内容:

需要:

  • 替换镜像,
  • 指定serviceType,
  • 指定imagePullSecrets
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: test-cluster-0
  labels:
    ray.io/scheduler-name: volcano
    volcano.sh/queue-name: kuberay-test-queue
spec:
  rayVersion: '2.4.0'
  headGroupSpec:
    rayStartParams: {}
    replicas: 1
    serviceType: "ClusterIP"
    template:
      spec:
        imagePullSecrets:
          - name: default-secret
        containers:
        - name: ray-head
          image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0
          resources:
            limits:
              cpu: "1"
              memory: "2Gi"
            requests:
              cpu: "1"
              memory: "2Gi"
  workerGroupSpecs:
    - groupName: worker
      rayStartParams: {}
      replicas: 2
      minReplicas: 2
      maxReplicas: 2
      template:
        spec:
          imagePullSecrets:
            - name: default-secret
          containers:
          - name: ray-head
            image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "1"
                memory: "1Gi"
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47

查看podGroup:

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-1-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2023-05-11T07:05:58Z"
  generation: 285
  name: ray-test-cluster-1-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RayCluster
    name: test-cluster-1
    uid: 1fa1e9e8-0e9c-4f36-a10d-9a88abb32853
  resourceVersion: "54132769"
  selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-1-pg
  uid: c12a3abd-1dec-41f2-9a9a-2c85e5a74f7a
spec:
  minMember: 3
  minResources:
    cpu: "3"
    memory: 4Gi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2023-05-11T12:15:26Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: f6dd2e6b-b98c-46be-8b0b-57862c2dcd90
    type: Scheduled
  phase: Running
  running: 3
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-3-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2023-05-11T12:14:39Z"
  generation: 4
  name: ray-test-cluster-3-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RayCluster
    name: test-cluster-3
    uid: d4878879-5635-459e-8678-ab668abfbd2b
  resourceVersion: "54132373"
  selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-3-pg
  uid: 139d23a7-0260-4eac-9ba9-b8599aab6eab
spec:
  minMember: 3
  minResources:
    cpu: "3"
    memory: 4Gi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2023-05-11T12:14:50Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: f30b3387-e00f-4481-995e-b175a424ea47
    type: Scheduled
  phase: Running
  running: 3
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-0-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2023-05-11T04:08:53Z"
  generation: 465
  name: ray-test-cluster-0-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RayCluster
    name: test-cluster-0
    uid: a0233819-6e8e-4555-9d3f-13d5d1b2301a
  resourceVersion: "54145929"
  selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-0-pg
  uid: 29585663-7352-44c5-8ae8-7575f1cb6937
spec:
  minMember: 3
  minResources:
    cpu: "3"
    memory: 4Gi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2023-05-11T12:34:17Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: 8d0c920b-f503-4f51-afd2-4cd9dca751ae
    type: Scheduled
  phase: Running
  running: 3
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101

5.查看

查看Queue

已经起了三个ray cluster

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#  kubectl get queue kuberay-test-queue -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"kuberay-test-queue"},"spec":{"capability":{"cpu":4,"memory":"6Gi"},"weight":1}}
  creationTimestamp: "2023-05-11T04:04:44Z"
  generation: 1
  name: kuberay-test-queue
  resourceVersion: "54132374"
  selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/kuberay-test-queue
  uid: dfb0d4af-f899-4c48-ac8c-7bcd5d0016b7
spec:
  capability:
    cpu: 4
    memory: 6Gi
  reclaimable: true
  weight: 1
status:
  allocated:
    cpu: "9"
    memory: 12Gi
  reservation: {}
  running: 3
  state: Open
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

查看pod

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl describe pod test-cluster-4-worker-worker-kvrzt
Name:           test-cluster-4-worker-worker-kvrzt
Namespace:      default
Priority:       0
Node:           172.18.154.132/172.18.154.132
Start Time:     Thu, 11 May 2023 20:37:09 +0800
Labels:         app.kubernetes.io/created-by=kuberay-operator
                app.kubernetes.io/name=kuberay
                ray.io/cluster=test-cluster-4
                ray.io/cluster-dashboard=test-cluster-4-dashboard
                ray.io/group=worker
                ray.io/identifier=test-cluster-4-worker
                ray.io/is-ray-node=yes
                ray.io/node-type=worker
                volcano.sh/queue-name=kuberay-test-queue
Annotations:    kubernetes.io/psp: psp-global
                ray.io/ft-enabled: false
                ray.io/health-state:
                scheduling.k8s.io/group-name: ray-test-cluster-4-pg
Status:         Pending
IP:
IPs:            <none>
Controlled By:  RayCluster/test-cluster-4
Init Containers:
  wait-gcs-ready:
    Container ID:
    Image:         swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -lc
      --
    Args:
      until ray health-check --address test-cluster-4-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; do echo wait for GCS to be ready; sleep 5; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      FQ_RAY_IP:  test-cluster-4-head-svc.default.svc.cluster.local
      RAY_IP:     test-cluster-4-head-svc
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro)
Containers:
  ray-head:
    Container ID:
    Image:         swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0
    Image ID:
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -lc
      --
    Args:
      ulimit -n 65536; ray start  --address=test-cluster-4-head-svc.default.svc.cluster.local:6379  --metrics-export-port=8080  --block  --num-cpus=1  --memory=1073741824
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi
    Environment:
      FQ_RAY_IP:                       test-cluster-4-head-svc.default.svc.cluster.local
      RAY_IP:                          test-cluster-4-head-svc
      RAY_CLUSTER_NAME:                 (v1:metadata.labels['ray.io/cluster'])
      RAY_PORT:                        6379
      RAY_ADDRESS:                     test-cluster-4-head-svc.default.svc.cluster.local:6379
      RAY_USAGE_STATS_KUBERAY_IN_USE:  1
      REDIS_PASSWORD:
    Mounts:
      /dev/shm from shared-mem (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  shared-mem:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  default-token-spw4v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-spw4v
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason                 Age   From     Message
  ----    ------                 ----  ----     -------
  Normal  Scheduled              42s   volcano  Successfully assigned default/test-cluster-4-worker-worker-kvrzt to 172.18.154.132
  Normal  SuccessfulMountVolume  41s   kubelet  Successfully mounted volumes for pod "test-cluster-4-worker-worker-kvrzt_default(9ecf1c20-b145-4eae-978b-a6181b5a21c5)"
  Normal  Pulling                41s   kubelet  Pulling image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0"
  Normal  Pulled                 16s   kubelet  Successfully pulled image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0" in 25.037677108s
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106

感觉并没有达到预期的结果,案例来说queue是4核6G,创建第二个ray cluster的时候就会pending,但是创建3个都没有pending,都是running。难道是volcano版本的问题?

6.总结

kuberay的volcano集成只展示了queue和podGroup的集成,没有展示vocalnojob的集成

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/586607
推荐阅读
相关标签
  

闽ICP备14008679号