当前位置:   article > 正文

kubernetes集群GPU支持方案_k3s哪个版本调用gpu

k3s哪个版本调用gpu

一.kubernetes对GPU的支持版本

kubernetes提供对分布式节点上的AMD GPU和NVIDIA GPU管理的实验性的支持。在V1.6中已经添加了对NVIDIA GPU的支持,并且经历了多次
向后不兼容的迭代。通过设备插件在v1.9中添加了对AMD GPU的支持。

从1.8版本开始,使用GPU的推荐方法是使用驱动插件。要是在1.10版本之前通过设备插件启用GPU支持,必须在整个系统中将DevicePlugins功能
设置为true: --feature-gates="DevicePlugins=true 。1.10之后版本不需要这么做了。

然后,必须在节点上安装相应供应商GPU驱动程序,并从GPU供应商(AMD/NVIDIA)运行相应的设备插件。

二.kubernetes集群部署GPU

kubernetes集群版本: 1.13.5
docker版本: 18.06.1-ce
os系统是版本: centos7.5
内核版本: 4.20.13-1.el7.elrepo.x86_64
Nvidia GPU型号: P4000

2.1 安装nvidia驱动

2.1.1 安装gcc

[root@k8s-01 ~]# yum install -y gcc

2.1.2 下载nvidia的驱动

下载链接: NVIDIA DRIVERS Linux x64 (AMD64/EM64T) Display Driver

这里我们下载的是如下版本:

  1. [root@k8s-01 ~]# ls NVIDIA-Linux-x86_64-410.93.run -alh
  2. -rw-r--r-- 1 root root 103M Jul 25 17:22 NVIDIA-Linux-x86_64-410.93.run

2.1.3 修改/etc/modprobe.d/blacklist.conf文件,阻止nouveau模块的加载

[root@k8s-01 ~]# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf

2.1.4 重新建立initramfs image

  1. [root@k8s-01 ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
  2. [root@k8s-01 ~]# dracut /boot/initramfs-$(uname -r).img $(uname -r)

2.1.5 执行驱动安装

[root@k8s-01 ~]# sh NVIDIA-Linux-x86_64-410.93.run -a -q -s 

2.1.6 安装工具包

只有驱动是不够的,我们需要一些工具包便于我们使用,其中cuda、cudnn是相关工具包。

  1. [root@k8s-01 ~]# cat /etc/yum.repos.d/cuda.repo
  2. [cuda]
  3. name=cuda
  4. baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
  5. enabled=1
  6. gpgcheck=1
  7. gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub
  8. [root@k8s-01 ~]#

2.2 安装nvidia-docker2

nvidia-docker是一个可以使用GPU的docker,在docker的基础上做了一层封装。目前基本被弃用。
nvidia-docker2是一个runtime,能更好的和docker兼容。

  • 获取nvidia-docker2的yum源
  1. [root@k8s-01 ~]# distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
  2. [root@k8s-01 ~]# curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
  • 查看nvidia-docker2的列表

这里我们需要安装支持docker-18.06.1-ce版本的nvidia-docker2版本,否则会不支持。

  1. [root@k8s-01 ~]# yum list nvidia-docker2 --showduplicate
  2. Loaded plugins: fastestmirror
  3. Loading mirror speeds from cached hostfile
  4. * base: mirrors.aliyun.com
  5. * epel: mirror01.idc.hinet.net
  6. * extras: mirrors.aliyun.com
  7. * updates: mirrors.163.com
  8. Installed Packages
  9. nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce @nvidia-docker
  10. Available Packages
  11. nvidia-docker2.noarch 2.0.0-1.docker1.12.6 nvidia-docker
  12. nvidia-docker2.noarch 2.0.0-1.docker17.03.2.ce nvidia-docker
  13. nvidia-docker2.noarch 2.0.0-1.docker17.06.1.ce nvidia-docker
  14. nvidia-docker2.noarch 2.0.0-1.docker17.06.2.ce nvidia-docker
  15. nvidia-docker2.noarch 2.0.0-1.docker17.09.0.ce nvidia-docker
  16. nvidia-docker2.noarch 2.0.1-1.docker1.12.6 nvidia-docker
  17. nvidia-docker2.noarch 2.0.1-1.docker1.13.1 nvidia-docker
  18. nvidia-docker2.noarch 2.0.1-1.docker17.03.2.ce nvidia-docker
  19. nvidia-docker2.noarch 2.0.1-1.docker17.06.2.ce nvidia-docker
  20. nvidia-docker2.noarch 2.0.1-1.docker17.09.0.ce nvidia-docker
  21. nvidia-docker2.noarch 2.0.1-1.docker17.09.1.ce nvidia-docker
  22. nvidia-docker2.noarch 2.0.2-1.docker1.12.6 nvidia-docker
  23. nvidia-docker2.noarch 2.0.2-1.docker1.13.1 nvidia-docker
  24. nvidia-docker2.noarch 2.0.2-1.docker17.03.2.ce nvidia-docker
  25. nvidia-docker2.noarch 2.0.2-1.docker17.06.2.ce nvidia-docker
  26. nvidia-docker2.noarch 2.0.2-1.docker17.09.0.ce nvidia-docker
  27. nvidia-docker2.noarch 2.0.2-1.docker17.09.1.ce nvidia-docker
  28. nvidia-docker2.noarch 2.0.2-1.docker17.12.0.ce nvidia-docker
  29. nvidia-docker2.noarch 2.0.3-1.docker1.12.6 nvidia-docker
  30. nvidia-docker2.noarch 2.0.3-1.docker1.13.1 nvidia-docker
  31. nvidia-docker2.noarch 2.0.3-1.docker17.03.2.ce nvidia-docker
  32. nvidia-docker2.noarch 2.0.3-1.docker17.06.2.ce nvidia-docker
  33. nvidia-docker2.noarch 2.0.3-1.docker17.09.0.ce nvidia-docker
  34. nvidia-docker2.noarch 2.0.3-1.docker17.09.1.ce nvidia-docker
  35. nvidia-docker2.noarch 2.0.3-1.docker17.12.0.ce nvidia-docker
  36. nvidia-docker2.noarch 2.0.3-1.docker17.12.1.ce nvidia-docker
  37. nvidia-docker2.noarch 2.0.3-1.docker18.03.0.ce nvidia-docker
  38. nvidia-docker2.noarch 2.0.3-1.docker18.03.1.ce nvidia-docker
  39. nvidia-docker2.noarch 2.0.3-1.docker18.06.0.ce nvidia-docker
  40. nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce nvidia-docker
  41. nvidia-docker2.noarch 2.0.3-1.docker18.06.2 nvidia-docker
  42. nvidia-docker2.noarch 2.0.3-1.docker18.09.0.ce nvidia-docker
  43. nvidia-docker2.noarch 2.0.3-1.docker18.09.1.ce nvidia-docker
  44. nvidia-docker2.noarch 2.0.3-1.docker18.09.2 nvidia-docker
  45. nvidia-docker2.noarch 2.0.3-1.docker18.09.2.ce nvidia-docker
  46. nvidia-docker2.noarch 2.0.3-1.docker18.09.3.ce nvidia-docker
  47. nvidia-docker2.noarch 2.0.3-1.docker18.09.4.ce nvidia-docker
  48. nvidia-docker2.noarch 2.0.3-2.docker18.06.2.ce nvidia-docker
  49. nvidia-docker2.noarch 2.0.3-2.docker18.09.5.ce nvidia-docker
  50. nvidia-docker2.noarch 2.0.3-3.docker18.06.3.ce nvidia-docker
  51. nvidia-docker2.noarch 2.0.3-3.docker18.09.5.ce nvidia-docker
  52. nvidia-docker2.noarch 2.0.3-3.docker18.09.6.ce nvidia-docker
  53. nvidia-docker2.noarch 2.0.3-3.docker18.09.7.ce nvidia-docker
  54. nvidia-docker2.noarch 2.1.0-1 nvidia-docker
  55. nvidia-docker2.noarch 2.1.1-1 nvidia-docker
  56. nvidia-docker2.noarch 2.2.0-1 nvidia-docker

这里我们安装2.0.3-1.docker18.06.1.ce版本即可。

  • 安装nvidia-docker2
[root@k8s-01 ~]# yum install -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
  • 配置默认的docker runtime为nvidia
  1. [root@k8s-01 ~]# cat /etc/docker/daemon.json
  2. {
  3. "default-runtime": "nvidia",
  4. "runtimes": {
  5. "nvidia": {
  6. "path": "/usr/bin/nvidia-container-runtime",
  7. "runtimeArgs": []
  8. }
  9. }
  10. }
  • 重启docker
[root@k8s-01 ~]# systemctl restart docker 
  • 查看docker信息
  1. [root@k8s-01 wf-deploy]# docker info
  2. Containers: 63
  3. Running: 0
  4. Paused: 0
  5. Stopped: 63
  6. Images: 51
  7. Server Version: 18.06.1-ce
  8. Storage Driver: overlay2
  9. Backing Filesystem: xfs
  10. Supports d_type: true
  11. Native Overlay Diff: true
  12. Logging Driver: json-file
  13. Cgroup Driver: cgroupfs
  14. Plugins:
  15. Volume: local
  16. Network: bridge host macvlan null overlay
  17. Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
  18. Swarm: inactive
  19. Runtimes: runc nvidia
  20. Default Runtime: nvidia
  21. Init Binary: docker-init
  22. containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
  23. runc version: 69663f0bd4b60df09991c08812a60108003fa340-dirty (expected: 69663f0bd4b60df09991c08812a60108003fa340)
  24. init version: fec3683
  25. Security Options:
  26. seccomp
  27. Profile: default
  28. Kernel Version: 4.20.13-1.el7.elrepo.x86_64
  29. Operating System: CentOS Linux 7 (Core)
  30. OSType: linux
  31. Architecture: x86_64
  32. CPUs: 2
  33. Total Memory: 7.79GiB
  34. Name: k8s-01
  35. ID: DWPY:P2I4:NWL4:3U3O:UTGC:PLJC:IGTO:7ZXJ:A7CD:SJGT:7WT5:WNGX
  36. Docker Root Dir: /var/lib/docker
  37. Debug Mode (client): false
  38. Debug Mode (server): false
  39. Registry: https://index.docker.io/v1/
  40. Labels:
  41. Experimental: false
  42. Insecure Registries:
  43. 192.168.50.2
  44. 127.0.0.0/8
  45. Live Restore Enabled: false

可以看出docker的默认runtime为nvidia

2.3 安装驱动插件

  • 获取插件的最新yaml文件
[root@k8s-01 ~]# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml

2.4 查看有GPU的节点

  1. [root@wf-229 ~]# kubectl get node 192.18.1.26 -ojson | jq '.status.allocatable'
  2. {
  3. "cpu": "48",
  4. "ephemeral-storage": "258961942919",
  5. "hugepages-1Gi": "0",
  6. "hugepages-2Mi": "0",
  7. "memory": "131471388Ki",
  8. "nvidia.com/gpu": "1",
  9. "pods": "200"
  10. }

2.5 创建包含GPU资源的POD

  1. [root@wf-229 gpu]# cat test.yaml
  2. apiVersion: v1
  3. kind: Pod
  4. metadata:
  5. labels:
  6. k8s-app: nginx-pod
  7. name: nginx-pod
  8. spec:
  9. containers:
  10. - image: nginx:latest
  11. imagePullPolicy: Always
  12. name: nginx
  13. ports:
  14. - containerPort: 80
  15. name: nginx
  16. protocol: TCP
  17. resources:
  18. limits:
  19. nvidia.com/gpu: "1"

2.6 查看Pod中分配的GPU资源

  1. [root@wf-229 gpu]# kubectl exec -it nginx-pod bash
  2. root@nginx-pod:/# nvidia-smi
  3. Mon Aug 12 11:39:05 2019
  4. +-----------------------------------------------------------------------------+
  5. | NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: N/A |
  6. |-------------------------------+----------------------+----------------------+
  7. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  8. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  9. |===============================+======================+======================|
  10. | 0 Quadro P4000 Off | 00000000:3B:00.0 Off | N/A |
  11. | 46% 34C P8 5W / 105W | 0MiB / 8119MiB | 0% Default |
  12. +-------------------------------+----------------------+----------------------+
  13. +-----------------------------------------------------------------------------+
  14. | Processes: GPU Memory |
  15. | GPU PID Type Process name Usage |
  16. |=============================================================================|
  17. | No running processes found |
  18. +-----------------------------------------------------------------------------+

三.CLI介绍

  • nvidia-container-cli

nvidia-container-cli 是一个命令行工具,用于配置Linux容器对GPU 硬件的使用。支持:
1)list: 打印nvidia驱动库及路径
2)info: 打印所有Nvidia GPU设备
3)configure: 进入给定进程的命名空间,执行必要操作保证容器内可以使用被指定的GPU以及对应能力(指定 Nvidia 驱动库)。 configure是我们使用到的主要命令,它将Nvidia 驱动库的so文件 和 GPU设备信息, 通过文件挂载的方式映射到容器中。

  • 查看NODE节点GPU
kubectl get node 192.18.1.26 -ojson | jq '.status.allocatable'
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/花生_TL007/article/detail/657778
推荐阅读
相关标签
  

闽ICP备14008679号