当前位置:   article > 正文

k8s RoCE 部署: k8s-rdma-shared-dev-plugin + macvlan cni_rdma shared dev plugin

rdma shared dev plugin


前言

写给自己的入门篇。后续会在原理方面持续更新


一、创建 k8s 集群

k8s 集群的创建有多种方法,可以按照官方文档的说明来操作 https://kubernetes.io/docs/setup/production-environment/tools/
对于新手(比如我)来说,我认为需要结合 k8s 架构来理解集群创建的过程。
k8s cluster component
(图片来源于 https://www.redhat.com/en/topics/containers/kubernetes-architecture)

需要安装:

  1. container runtime: 用于运行 container 的服务,每个 node 都需要安装并启动
  2. kubectl: 用户 CLI,是集群资源管理、容器部署、调试时的主要工具
  3. kubelet: 运行在 node 上的服务,确保 pod 与 container 启动并运行 (需要关闭 swap),每个 node 都需要安装并启动
  4. kubeadm: 创建与管理集群

初始化 control-plane (master) 节点:

# kubeadm init
  • 1

这个过程一般会遇到很多报错,Google 是最好的寻求解决方案的地方。初始化成功后,将会有如下输出:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 10.7.157.30:6443 --token 2rg4l1.n0rhvdp0uvxdrxjv \
        --discovery-token-ca-cert-hash sha256:fd7d661ec35868d036761e844597807a3d076daf3c8b71de6e1b55ee01e66a32
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

此时会发现,如下 Pods 已创建,除了 coredns 处于 Pending 状态,其余都处于 Running 状态:

# export KUBECONFIG=/etc/kubernetes/admin.conf
# kubectl get node -o wide
NAME          STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION           CONTAINER-RUNTIME
node1         Ready    control-plane   2m50s   v1.24.0   10.7.157.30   <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)   3.10.0-1062.el7.x86_64   containerd://1.6.4

# kubectl get pods --all-namespaces
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   coredns-6d4b75cb6d-752q4              0/1     Pending   0          35s
kube-system   coredns-6d4b75cb6d-7h2g5              0/1     Pending   0          35s
kube-system   etcd-node1                            1/1     Running   5          47s
kube-system   kube-apiserver-node1                  1/1     Running   4          48s
kube-system   kube-controller-manager-node1         1/1     Running   1          47s
kube-system   kube-proxy-px447                      1/1     Running   0          35s
kube-system   kube-scheduler-node1                  1/1     Running   4          48s
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

二、启用 primary network

先说 k8s 的网络模型
对于 k8s 网络,核心理念是 - 每个 pod 都有唯一的 IP。 Pod 中所有 container 共享该 IP,并可以与其他 Pod 通信。
通常会在 kubeadm.config.yaml 中设置 pod subnet 作为 CIDR 块,即一系列 IP 地址,在此范围内分配 IP 给 pod:

#### in kubeadm-config.yaml ####
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.24.0
networking:
  podSubnet: 10.244.0.0/16
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Pod 之间的通信,通常会结合管道对与以太网桥来实现:
primary network

cni0 本质上是 Linux 网桥,可以发送 ARP request 与解析 ARP response
eno1 作为 node 之间通信的网络接口,启用了 IP 转发,可以依据 Route Table 将收到的数据包转发给 cni0

为了启用 k8s primary network,需要安装 primary network CNI

有多种选择,如 flannel, Calico, WeaveNet 等。 此例选取 flannel,需要设置 flannel 使用的网络接口:

# yum install -y flannel
# vi /etc/sysconfig/flanneld  ## add additional options:
FLANNEL_OPTIONS="-iface=eno1"
# cp /usr/bin/flanneld /opt/bin
# kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

此时 coredns 将会变为 Running 状态

三、启用 secondary network

Primary network 常用于 Pod 之间的基本通信。通常需要为 Pod 提供 secondary network,作为高性能网络供应用程序使用:
multi-networking
需要部署:

  1. k8s-rdma-shared-dev-plugin
  2. Multus CNI
  3. Secondary CNI
  4. Multi-Network CRD

其中 Multus CNI 可以看作一个 meta plugin,与其他 CNI plugin 配合使用,以实现多网络接口的功能:
Mulus CNI as meta plugin

k8s-rdma-shared-dev-plugin

创建 configmap

# cat k8s-rdma-shared-dev-plugin-config-map.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: kube-system
data:
  config.json: |
    {
        "periodicUpdateInterval": 300,
        "configList": [{
             "resourceName": "cx5_bond_shared_devices_a",
             "rdmaHcaMax": 1000,
             "selectors": {
               "vendors": ["15b3"],
               "deviceIDs": ["1017"]
             }
           },
           {
             "resourceName": "cx6dx_shared_devices_b",
             "rdmaHcaMax": 500,
             "selectors": {
               "vendors": ["15b3"],
               "deviceIDs": ["101d"]
             }
           }
        ]
    }


# kubectl create -f k8s-rdma-shared-dev-plugin-config-map.yaml
configmap/rdma-devices created
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

创建 k8s-rdma-shared-dev-plugin daemonset

# kubectl create -f https://raw.githubusercontent.com/Mellanox/k8s-rdma-shared-dev-plugin/master/images/k8s-rdma-shared-dev-plugin-ds.yaml
daemonset.apps/rdma-shared-dp-ds created
  • 1
  • 2

若上述 k8s-rdma-shared-dev-plugin-ds.yaml 的 git rep 无法访问,可以采用如下方式:

# git clone https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git
# cd k8s-rdma-shared-dev-plugin/
# kubectl create -f deployment/k8s/base/daemonset.yaml
daemonset.apps/rdma-shared-dp-ds created
  • 1
  • 2
  • 3
  • 4

Multus CNI

# kubectl create -f https://raw.githubusercontent.com/intel/multus-cni/master/images/multus-daemonset.yml
ustomresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds-amd64 created
daemonset.apps/kube-multus-ds-ppc64le created
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

若上述 multus-daemonset.yml 的 git rep 无法访问,可以采用如下方式:

# git clone https://github.com/k8snetworkplumbingwg/multus-cni.git
# cd multus-cni/
# kubectl create -f deployments/multus-daemonset.yml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-cni-config created
daemonset.apps/kube-multus-ds created
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Secondary CNI

# mkdir -p /opt/cni/bin
# wget https://github.com/containernetworking/plugins/releases/download/v1.1.1/cni-plugins-linux-amd64-v1.1.1.tgz
# tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v1.1.1.tgz
  • 1
  • 2
  • 3

查看 /opt/cni/bin,可以看到已有多个 cni 插件:

# ls /opt/cni/bin
bandwidth  bridge  dhcp  firewall  host-device  host-local  ipvlan  loopback  macvlan  portmap  ptp  sbr  static  tuning  vlan  vrf
  • 1
  • 2

此例将使用 macvlan CNI

Multi-Network CRD

为 macvlan CNI 创建两个 network attachment,注意 IP 地址范围与 primary network 的 IP 地址范围不可有重合:

# cat macvlan_cx6dx.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-cx6dx-conf
spec:
  config: '{
    "cniVersion": "0.3.1",
    "type": "macvlan",
     "master": "ens2f0",
        "ipam": {
                "type": "host-local",
                "subnet": "10.56.217.0/24",
                "rangeStart": "10.56.217.171",
                "rangeEnd": "10.56.217.181",
                "routes": [
                        { "dst": "0.0.0.0/0" }
                ],
                "gateway": "10.56.217.1"
        }
  }'

# cat macvlan_cx5_bond.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-cx5-bond-conf
spec:
  config: '{
    "cniVersion": "0.3.1",
    "type": "macvlan",
     "master": "bond0",
        "ipam": {
                "type": "host-local",
                "subnet": "10.56.217.0/24",
                "rangeStart": "10.56.217.71",
                "rangeEnd": "10.56.217.81",
                "routes": [
                        { "dst": "0.0.0.0/0" }
                ],
                "gateway": "10.56.217.1"
        }
  }'

# kubectl create -f macvlan_cx6dx.yaml
networkattachmentdefinition.k8s.cni.cncf.io/macvlan-cx6dx-conf created

# kubectl create -f macvlan_cx5_bond.yaml
networkattachmentdefinition.k8s.cni.cncf.io/macvlan-cx5-bond-conf created
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49

四、启用 pod

本例仅用到 macvlan-cx5-bond-conf,若需要使用 macvlan-cx6dx-conf,可在 test-xxx-pod.yaml 中指定对应的 annotation 与 resources:
pod with multi-networking interfaces

# cat test-cx5-bond-pod1.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-cx5-bond-pod1
  annotations:
    k8s.v1.cni.cncf.io/networks: default/macvlan-cx5-bond-conf
spec:
  restartPolicy: OnFailure
  containers:
  - image: mellanox/rping-test
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/cx5_bond_shared_devices_a: 1
      requests:
        rdma/cx5_bond_shared_devices_a: 1
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

# kubectl create -f test-cx5-bond-pod1.yaml
pod/mofed-test-cx5-bond-pod1 created

# cat test-cx5-bond-pod2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-cx5-bond-pod2
  annotations:
    k8s.v1.cni.cncf.io/networks: default/macvlan-cx5-bond-conf
spec:
  restartPolicy: OnFailure
  containers:
  - image: mellanox/rping-test
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        rdma/cx5_bond_shared_devices_a: 1
      requests:
        rdma/cx5_bond_shared_devices_a: 1
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

# kubectl create -f test-cx5-bond-pod2.yaml
pod/mofed-test-cx5-bond-pod2 created
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59

五、在 pod 中启动 RoCE 流量

此时可在 pod 中使用 secondary network (eth1) 启动 RoCE 流量:
RoCE traffic vie eth1

# kubectl get pods -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
default       mofed-test-cx5-bond-pod1              1/1     Running   0          3m41s
default       mofed-test-cx5-bond-pod2              1/1     Running   0          32s
default       mofed-test-macvlan-pod                1/1     Running   0          4d9h
kube-system   coredns-6d4b75cb6d-752q4              1/1     Running   0          5d3h
kube-system   coredns-6d4b75cb6d-7h2g5              1/1     Running   0          5d3h
kube-system   etcd-node1                            1/1     Running   5          5d3h
kube-system   kube-apiserver-node1                  1/1     Running   4          5d3h
kube-system   kube-controller-manager-node1         1/1     Running   1          5d3h
kube-system   kube-flannel-ds-xwlr2                 1/1     Running   0          5d3h
kube-system   kube-multus-ds-kqhqn                  1/1     Running   0          5d2h
kube-system   kube-proxy-px447                      1/1     Running   0          5d3h
kube-system   kube-scheduler-node1                  1/1     Running   4          5d3h
kube-system   rdma-shared-dp-ds-vps6x               1/1     Running   0          21m
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

mofed-test-cx5-bond-pod1

# kubectl exec -it mofed-test-cx5-bond-pod1 bash
[root@mofed-test-cx5-bond-pod1 /]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.244.0.211  netmask 255.255.255.0  broadcast 10.244.0.255
        inet6 fe80::e45d:c4ff:fe4c:f3b3  prefixlen 64  scopeid 0x20<link>
        ether e6:5d:c4:4c:f3:b3  txqueuelen 0  (Ethernet)
        RX packets 12  bytes 1016 (1016.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 612 (612.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

net1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 10.56.217.71  netmask 255.255.255.0  broadcast 10.56.217.255
        ether fa:a4:6e:24:3e:ba  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mofed-test-cx5-bond-pod1 /]# ib_write_bw -d mlx5_bond_0 -F --report_gbits
************************************
* Waiting for client to connect... *
************************************
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

mofed-test-cx5-bond-pod1

# kubectl exec -it mofed-test-cx5-bond-pod2 bash
[root@mofed-test-cx5-bond-pod2 /]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.244.0.212  netmask 255.255.255.0  broadcast 10.244.0.255
        inet6 fe80::20d6:7eff:fec0:4e39  prefixlen 64  scopeid 0x20<link>
        ether 22:d6:7e:c0:4e:39  txqueuelen 0  (Ethernet)
        RX packets 12  bytes 1016 (1016.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 612 (612.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

net1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 10.56.217.72  netmask 255.255.255.0  broadcast 10.56.217.255
        ether a6:46:b9:94:b0:31  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@mofed-test-cx5-bond-pod2 /]# ib_write_bw -d mlx5_bond_0 -F --report_gbits 10.56.217.71
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_bond_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 4
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x117c PSN 0xbfdcaf RKey 0x00511b VAddr 0x007fdf469fd000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:56:217:72
 remote address: LID 0000 QPN 0x117d PSN 0x75cbaa RKey 0x004407 VAddr 0x007f65e74dc000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:56:217:71
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5000             82.62              82.55              0.157445
---------------------------------------------------------------------------------------
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51

总结

TBD

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/858410
推荐阅读
相关标签
  

闽ICP备14008679号