[root@master0 ~]# kubectl  get nodes -o wide
NAME      STATUS   ROLES           AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION         CONTAINER-RUNTIME
master0   Ready    master,worker   3d3h   v1.22.3+4dd1b5a   10.253.24.4   <none>        CCLinux 2203   5.15.13-0.el9.x86_64   cri-o://1.23.2
master1   Ready    master,worker   3d3h   v1.22.3+4dd1b5a   10.253.24.5   <none>        CCLinux 2203   5.15.13-0.el9.x86_64   cri-o://1.23.2
master2   Ready    master,worker   3d3h   v1.22.3+4dd1b5a   10.253.24.6   <none>        CCLinux 2203   5.15.13-0.el9.x86_64   cri-o://1.23.2

Capacity:
  cpu:                8
  ephemeral-storage:  184230Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32603308Ki
  pods:               250
Allocatable:
  cpu:                7500m
  ephemeral-storage:  173861240545
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31452332Ki
  pods:               250

processor	: 7
vendor_id	: HygonGenuine
cpu family	: 24
model		: 1
model name	: Hygon C86 7285 32-core Processor
stepping	: 1
microcode	: 0x80901047
cpu MHz		: 1999.999
cache size	: 512 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 14
initial apicid	: 14
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext cpb ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero arat overflow_recov succor
bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3999.99
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 45 bits physical, 48 bits virtual
power management:

单机10.253.15.56

Capacity:
  cpu:                  8
  ephemeral-storage:    194465Mi
  example.com/fakecpu:  1k
  hugepages-1Gi:        0
  hugepages-2Mi:        0
  memory:               32603860Ki
  pods:                 250
Allocatable:
  cpu:                  7500m
  ephemeral-storage:    183520198353
  example.com/fakecpu:  1k
  hugepages-1Gi:        0
  hugepages-2Mi:        0
  memory:               31452884Ki
  pods:                 250

processor	: 7
vendor_id	: HygonGenuine
cpu family	: 24
model		: 1
model name	: Hygon C86 7285 32-core Processor
stepping	: 1
microcode	: 0x80901047
cpu MHz		: 1999.999
cache size	: 512 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 14
initial apicid	: 14
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext cpb ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero arat overflow_recov succor
bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3999.99
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 45 bits physical, 48 bits virtual
power management:

编译和运行:

release测试执行指导

运行日志:

集群串行执行:

集群并行执行:

失败的测试用例收集:

CCOS 0.0.0 单节点失败用例

CCOS 0.0.0 集群失败用例

需要解决的失败用例:

一. [sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

问题描述:

告警非info、firing状态的告警数大于1  alertstate="firing",severity!="info"    告警信息

metric": {
          "__name__": "ALERTS",
          "alertname": "CannotRetrieveUpdates",
          "alertstate": "firing",
          "endpoint": "metrics",
          "instance": "10.255.245.137:9099",
          "job": "cluster-version-operator",
          "namespace": "openshift-cluster-version",
          "pod": "cluster-version-operator-79fd7675bd-nz5hr",
          "prometheus": "openshift-monitoring/k8s",
          "service": "cluster-version-operator",
          "severity": "warning"
        },
        "value": [
          1649754743.841,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "KubeContainerWaiting",
          "alertstate": "firing",
          "container": "machine-config-server",
          "namespace": "default",
          "pod": "bootstrap-machine-config-operator-qqmaster0",
          "prometheus": "openshift-monitoring/k8s",
          "severity": "warning"
        },
        "value": [
          1649754743.841,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "KubePodNotReady",
          "alertstate": "firing",
          "namespace": "default",
          "pod": "bootstrap-machine-config-operator-qqmaster0",
          "prometheus": "openshift-monitoring/k8s",
          "severity": "warning"
        },
        "value": [
          1649754743.841,
          "1"
        ]
      }

started: (0/4/333) "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

passed: (500ms) 2022-04-18T07:08:20 "[sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

skip [github.com/openshift/origin/test/extended/machines/cluster.go:44]: cluster does not have machineset resources

skipped: (500ms) 2022-04-18T07:08:20 "[sig-cluster-lifecycle][Feature:Machines][Early] Managed cluster should have same number of Machines and Nodes [Suite:openshift/conformance/parallel]"

passed: (500ms) 2022-04-18T07:08:20 "[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]"

[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1453
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1453
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/util/client.go:142
STEP: Creating a kubernetes client
[BeforeEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/prometheus/prometheus.go:250
[It] shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/prometheus/prometheus.go:506
Apr 18 15:08:22.024: INFO: Creating namespace "e2e-test-prometheus-v8jrp"
Apr 18 15:08:22.291: INFO: Waiting for ServiceAccount "default" to be provisioned...
Apr 18 15:08:22.396: INFO: Creating new exec pod
STEP: perform prometheus metric query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
Apr 18 15:08:26.433: INFO: Running '/usr/bin/kubectl --server=https://api.ocp4e2e.samuele2e.cn:6443 --kubeconfig=/root/.kube/config --namespace=e2e-test-prometheus-v8jrp exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'
Apr 18 15:08:26.633: INFO: stderr: "+ curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1'\n"
Apr 18 15:08:26.633: INFO: stdout: "{\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"CannotRetrieveUpdates\",\"alertstate\":\"firing\",\"endpoint\":\"metrics\",\"instance\":\"10.253.24.6:9099\",\"job\":\"cluster-version-operator\",\"namespace\":\"openshift-cluster-version\",\"pod\":\"cluster-version-operator-57f968f56-mv9s8\",\"prometheus\":\"openshift-monitoring/k8s\",\"service\":\"cluster-version-operator\",\"severity\":\"warning\"},\"value\":[1650265706.616,\"1\"]},{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"SystemMemoryExceedsReservation\",\"alertstate\":\"firing\",\"node\":\"master1\",\"prometheus\":\"openshift-monitoring/k8s\",\"severity\":\"warning\"},\"value\":[1650265706.616,\"1\"]}]}}\n"
STEP: perform prometheus metric query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
Apr 18 15:08:36.633: INFO: Running '/usr/bin/kubectl --server=https://api.ocp4e2e.samuele2e.cn:6443 --kubeconfig=/root/.kube/config --namespace=e2e-test-prometheus-v8jrp exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'
Apr 18 15:08:36.804: INFO: stderr: "+ curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1'\n"
Apr 18 15:08:36.804: INFO: stdout: "{\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"CannotRetrieveUpdates\",\"alertstate\":\"firing\",\"endpoint\":\"metrics\",\"instance\":\"10.253.24.6:9099\",\"job\":\"cluster-version-operator\",\"namespace\":\"openshift-cluster-version\",\"pod\":\"cluster-version-operator-57f968f56-mv9s8\",\"prometheus\":\"openshift-monitoring/k8s\",\"service\":\"cluster-version-operator\",\"severity\":\"warning\"},\"value\":[1650265716.792,\"1\"]},{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"SystemMemoryExceedsReservation\",\"alertstate\":\"firing\",\"node\":\"master1\",\"prometheus\":\"openshift-monitoring/k8s\",\"severity\":\"warning\"},\"value\":[1650265716.792,\"1\"]}]}}\n"
Apr 18 15:08:36.804: INFO: promQL query returned unexpected results:
ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
[
  {
    "metric": {
      "__name__": "ALERTS",
      "alertname": "CannotRetrieveUpdates",
      "alertstate": "firing",
      "endpoint": "metrics",
      "instance": "10.253.24.6:9099",
      "job": "cluster-version-operator",
      "namespace": "openshift-cluster-version",
      "pod": "cluster-version-operator-57f968f56-mv9s8",
      "prometheus": "openshift-monitoring/k8s",
      "service": "cluster-version-operator",
      "severity": "warning"
    },
    "value": [
      1650265706.616,
      "1"
    ]
  },
  {
    "metric": {
      "__name__": "ALERTS",
      "alertname": "SystemMemoryExceedsReservation",
      "alertstate": "firing",
      "node": "master1",
      "prometheus": "openshift-monitoring/k8s",
      "severity": "warning"
    },
    "value": [
      1650265706.616,
      "1"
    ]
  }
]
STEP: perform prometheus metric query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
Apr 18 15:08:46.805: INFO: Running '/usr/bin/kubectl --server=https://api.ocp4e2e.samuele2e.cn:6443 --kubeconfig=/root/.kube/config --namespace=e2e-test-prometheus-v8jrp exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'
Apr 18 15:08:46.899: INFO: rc: 1
Apr 18 15:08:46.899: INFO: promQL query returned unexpected results:
ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
[
  {
    "metric": {
      "__name__": "ALERTS",
      "alertname": "CannotRetrieveUpdates",
      "alertstate": "firing",
      "endpoint": "metrics",
      "instance": "10.253.24.6:9099",
      "job": "cluster-version-operator",
      "namespace": "openshift-cluster-version",
      "pod": "cluster-version-operator-57f968f56-mv9s8",
      "prometheus": "openshift-monitoring/k8s",
      "service": "cluster-version-operator",
      "severity": "warning"
    },
    "value": [
      1650265716.792,
      "1"
    ]
  },
  {
    "metric": {
      "__name__": "ALERTS",
      "alertname": "SystemMemoryExceedsReservation",
      "alertstate": "firing",
      "node": "master1",
      "prometheus": "openshift-monitoring/k8s",
      "severity": "warning"
    },
    "value": [
      1650265716.792,
      "1"
    ]
  }
]
STEP: perform prometheus metric query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
Apr 18 15:08:56.900: INFO: Running '/usr/bin/kubectl --server=https://api.ocp4e2e.samuele2e.cn:6443 --kubeconfig=/root/.kube/config --namespace=e2e-test-prometheus-v8jrp exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'
Apr 18 15:08:56.998: INFO: rc: 1
STEP: perform prometheus metric query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
Apr 18 15:09:06.999: INFO: Running '/usr/bin/kubectl --server=https://api.ocp4e2e.samuele2e.cn:6443 --kubeconfig=/root/.kube/config --namespace=e2e-test-prometheus-v8jrp exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1"'
Apr 18 15:09:07.078: INFO: rc: 1
[AfterEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/util/client.go:140
STEP: Collecting events from namespace "e2e-test-prometheus-v8jrp".
STEP: Found 7 events.
Apr 18 15:09:17.092: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for execpod: { } Scheduled: Successfully assigned e2e-test-prometheus-v8jrp/execpod to master1
Apr 18 15:09:17.092: INFO: At 2022-04-18 15:08:24 +0800 CST - event for execpod: {multus } AddedInterface: Add eth0 [21.100.0.244/23] from ovn-kubernetes
Apr 18 15:09:17.092: INFO: At 2022-04-18 15:08:24 +0800 CST - event for execpod: {kubelet master1} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" already present on machine
Apr 18 15:09:17.092: INFO: At 2022-04-18 15:08:24 +0800 CST - event for execpod: {kubelet master1} Created: Created container agnhost-container
Apr 18 15:09:17.092: INFO: At 2022-04-18 15:08:24 +0800 CST - event for execpod: {kubelet master1} Started: Started container agnhost-container
Apr 18 15:09:17.092: INFO: At 2022-04-18 15:08:40 +0800 CST - event for execpod: {taint-controller } TaintManagerEviction: Marking for deletion Pod e2e-test-prometheus-v8jrp/execpod
Apr 18 15:09:17.092: INFO: At 2022-04-18 15:08:40 +0800 CST - event for execpod: {kubelet master1} Killing: Stopping container agnhost-container
Apr 18 15:09:17.094: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Apr 18 15:09:17.094: INFO: 
Apr 18 15:09:17.097: INFO: skipping dumping cluster info - cluster too large
[AfterEach] [sig-instrumentation] Prometheus
  github.com/openshift/origin/test/extended/util/client.go:141
STEP: Destroying namespace "e2e-test-prometheus-v8jrp" for this suite.
fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:533]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        {
            s: "unable to execute query ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: unable to execute query host command failed: error running /usr/bin/kubectl --server=https://api.ocp4e2e.samuele2e.cn:6443 --kubeconfig=/root/.kube/config --namespace=e2e-test-prometheus-v8jrp exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' \"https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1\":\nCommand stdout:\n\nstderr:\nError from server (NotFound): pods \"execpod\" not found\n\nerror:\nexit status 1\n",
        },
    ]
    unable to execute query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: unable to execute query host command failed: error running /usr/bin/kubectl --server=https://api.ocp4e2e.samuele2e.cn:6443 --kubeconfig=/root/.kube/config --namespace=e2e-test-prometheus-v8jrp exec execpod -- /bin/sh -x -c curl --retry 15 --max-time 2 --retry-delay 1 -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc1T01GM2Qtc2NLQjZLb2ZENkFKcEcxLW5USVhkbGpVNUY1cGV5UTUtOVUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tNnB4YngiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYmFlNzhkZTktZGE4NS00OTdmLTkyNzItOWY5ZjI4MWVlZGM3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.5VmNcWMFRgyXdUZVcUSXQazo-hoA3Tzhyq3zJJn-Zbuqvdxs0c9lb2iobTriV-bwA4Ub6e9pw3dHIJOoaqBBD4nSEmXGRm6RfrHaKeU_t-d_BfHAyP-K4wUsyA6DV0Rpk3JhONz1vFX2OdEvu5aiZXJOyxdHKbOvn4y_caeUDPOj1TKFHkfE81zoG-mpZomYuEW7rudk2yHblTS_jfinSelC9Hdi62czl-omycez6XCyqHvCI4yFwRBQv3o409s5Xj2y5oMik2YVmo__TIgi0bO-VKzYT58KYRSW9uK4UIJMfyvSzDN-6j2eugyfWwfSPCKUh6NmLxq2NDSPMd4yFw' "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1":
    Command stdout:
    
    stderr:
    Error from server (NotFound): pods "execpod" not found
    
    error:
    exit status 1
    
occurred

failed: (56.7s) 2022-04-18T07:09:17 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

Apr 18 15:08:36.804: INFO: promQL query returned unexpected results:
ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1
[
  {
    "metric": {
      "__name__": "ALERTS",
      "alertname": "CannotRetrieveUpdates",
      "alertstate": "firing",
      "endpoint": "metrics",
      "instance": "10.253.24.6:9099",
      "job": "cluster-version-operator",
      "namespace": "openshift-cluster-version",
      "pod": "cluster-version-operator-57f968f56-mv9s8",
      "prometheus": "openshift-monitoring/k8s",
      "service": "cluster-version-operator",
      "severity": "warning"
    },
    "value": [
      1650265706.616,
      "1"
    ]
  },
  {
    "metric": {
      "__name__": "ALERTS",
      "alertname": "SystemMemoryExceedsReservation",
      "alertstate": "firing",
      "node": "master1",
      "prometheus": "openshift-monitoring/k8s",
      "severity": "warning"
    },
    "value": [
      1650265706.616,
      "1"
    ]
  }
]
STEP: perform prometheus metric query ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1

问题分析:

"metric": {
      "__name__": "ALERTS",
      "alertname": "CannotRetrieveUpdates",
      "alertstate": "firing",
      "endpoint": "metrics",
      "instance": "10.253.24.6:9099",
      "job": "cluster-version-operator",
      "namespace": "openshift-cluster-version",
      "pod": "cluster-version-operator-57f968f56-mv9s8",
      "prometheus": "openshift-monitoring/k8s",
      "service": "cluster-version-operator",
      "severity": "warning"
    },

    "metric": {
      "__name__": "ALERTS",
      "alertname": "SystemMemoryExceedsReservation",
      "alertstate": "firing",
      "node": "master1",
      "prometheus": "openshift-monitoring/k8s",
      "severity": "warning"
    },

可以看出:

在firing阶段有两个告警,
1. CannotRetrieveUpdates
2. SystemMemoryExceedsReservation

1953846 – SystemMemoryExceedsReservation alert should consider hugepage reservation

SystemMemoryExceedsReservation alert which is added from OCP 4.6 should consider Hugepage reservation.

The SystemMemoryExceedsReservation alert uses following Prometheus query:

~~~
sum by (node) (container_memory_rss{id=\"/system.slice\"}) > ((sum by (node) (kube_node_status_capacity{resource=\"\memory\"} - kube_node_status_allocatable{resource=\"memory\"})) * 0.9)
~~~

As per the above query, If hugepages were set on worker node, the right side of the check would contain hugepages that are supposed to be allocated by the applications. The left side indicates working memory allocated by system processes related to containers running inside the node.
In this case, the right side would be added much more application memory size that is irrelevant to the system reserved memory, so the alert would become meaningless.




For example, if a node has 30GiB of hugepages like below:

~~~
$ oc describe node <node-name>

...
Capacity:
cpu:                      80
ephemeral-storage:        2096613Mi
hugepages-1Gi:            30Gi
hugepages-2Mi:            0
memory:                   527977304Ki
openshift.io/dpdk_ext0:   0
openshift.io/f1u:         10
openshift.io/sriov_ext0:  10
pods:                     250

Allocatable:
cpu:                      79500m
ephemeral-storage:        1977538520680
hugepages-1Gi:            30Gi
hugepages-2Mi:            0
memory:                   495369048Ki
openshift.io/dpdk_ext0:   0
openshift.io/f1u:         10
openshift.io/sriov_ext0:  10
pods:                     250
..
~~~

The system-reserved contains the 30GiB of huge pages which will be allocated by the applications. 

SystemReserved  =    (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}))   
 = 527977304Ki - 495369048Ki = 31GiB

And (container_memory_rss {id = \"/system.slice \"}) is unlikely to be larger than the right side, as the underlying system process rarely uses huge pages as far as I know.

I am not sure If my understanding is correct or not , if I am wrong please let me know.

https://github.com/openshift/machine-config-operator/blob/f86955971533aacbb4bb66f5c7041057d3f33566/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L53-L60

- name: system-memory-exceeds-reservation
      rules:
        - alert: SystemMemoryExceedsReservation
          expr: |
            sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)
          for: 15m
          labels:
            severity: warning
          annotations:
            message: "System memory usage of {{ $value | humanize }} on {{ $labels.node }} exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default reservation is expected to be sufficient for most configurations and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods (either due to rate of change or at steady state)."

Allocating resources for nodes - Working with nodes | Nodes | OpenShift Container Platform 4.6

Managing nodes - Working with nodes | Nodes | OpenShift Container Platform 4.10

以单机环境10.253.15.56做计算:

Capacity:
  cpu:                8
  ephemeral-storage:  194465Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32603860Ki
  pods:               250


Allocatable:
  cpu:                7500m
  ephemeral-storage:  183520198353
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             31452884Ki
  pods:               250

(32603860Ki-31452884Ki) * 0.9

= 1,150,976Ki * 0.9

= 1,035,878.4 Ki

= 1,011.6 MB

= 0.98 GB

一旦容器所需要的内存超过 0.98GB, 则普罗米修斯将开始报警

问题解决:

解决方法一: 买内存, 给单节点和集群的openshift都扩充内存

所需要的内存, 需要满足测试时:

https://github.com/openshift/machine-config-operator/blob/f86955971533aacbb4bb66f5c7041057d3f33566/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L53-L60

sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)

解决方法二: 修改openshit的machine-config-operator组件中的监控属性,

https://github.com/openshift/machine-config-operator

https://github.com/openshift/machine-config-operator/blob/f86955971533aacbb4bb66f5c7041057d3f33566/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L53-L60

直接修改为永久不报警,

- name: system-memory-exceeds-reservation
      rules:
        - alert: SystemMemoryExceedsReservation
          expr: |
            sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)
          for: 15m
          labels:
            severity: warning
          annotations:
            message: "System memory usage of {{ $value | humanize }} on {{ $labels.node }} exceeds 90% of the reservation. Reserved memory ensures system processes can function even when the node is fully allocated and protects against workload out of memory events impacting the proper functioning of the node. The default reservation is expected to be sufficient for most configurations and should be increased (https://docs.openshift.com/container-platform/latest/nodes/nodes/nodes-nodes-managing.html) when running nodes with high numbers of pods (either due to rate of change or at steady state)."

二. [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

问题描述:

Unexpected alerts fired or pending after the test run:

alert CannotRetrieveUpdates fired for 2313 seconds with labels: {endpoint="metrics", instance="10.255.245.135:9099", job="cluster-version-operator", namespace="openshift-cluster-version", pod="cluster-version-operator-79fd7675bd-8vwqd", service="cluster-version-operator", severity="warning"}
alert KubeContainerWaiting fired for 2313 seconds with labels: {container="machine-config-server", namespace="default", pod="bootstrap-machine-config-operator-master0", severity="warning"}
alert KubePodNotReady fired for 2313 seconds with labels: {namespace="default", pod="bootstrap-machine-config-operator-master0", severity="warning"}
alert etcdGRPCRequestsSlow fired for 60 seconds with labels: {endpoint="etcd-metrics", grpc_method="Status", grpc_service="etcdserverpb.Maintenance", instance="10.255.245.135:9979", job="etcd", namespace="openshift-etcd", pod="etcd-master0", service="etcd", severity="critical"}

[root@master0 zsl]# cat e2e202204014-cluster-serial.log  | grep "sig-instrumentation"
started: (0/4/333) "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
[BeforeEach] [sig-instrumentation] Prometheus
[BeforeEach] [sig-instrumentation] Prometheus
[AfterEach] [sig-instrumentation] Prometheus
[AfterEach] [sig-instrumentation] Prometheus
failed: (56.7s) 2022-04-18T07:09:17 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
started: (7/326/333) "[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
started: (7/327/333) "[sig-instrumentation][Late] Alerts shouldn't exceed the 500 series limit of total series sent via telemetry from each cluster [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
started: (7/331/333) "[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
passed: (7.6s) 2022-04-18T08:00:17 "[sig-instrumentation][Late] Alerts shouldn't exceed the 500 series limit of total series sent via telemetry from each cluster [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
passed: (7.9s) 2022-04-18T08:00:18 "[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
[BeforeEach] [sig-instrumentation][Late] Alerts
[BeforeEach] [sig-instrumentation][Late] Alerts
[AfterEach] [sig-instrumentation][Late] Alerts
[AfterEach] [sig-instrumentation][Late] Alerts
failed: (7.7s) 2022-04-18T08:00:18 "[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
Apr 18 07:08:20.489 I e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" started
Apr 18 07:08:20.489 - 56s   E e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" e2e test finished As "Failed"
Apr 18 07:09:17.142 E e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" finishedStatus/Failed
Apr 18 08:00:10.359 I e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" started
Apr 18 08:00:10.359 - 7s    I e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" e2e test finished As "Passed"
Apr 18 08:00:10.359 I e2e-test/"[sig-instrumentation][Late] Alerts shouldn't exceed the 500 series limit of total series sent via telemetry from each cluster [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" started
Apr 18 08:00:10.359 - 7s    I e2e-test/"[sig-instrumentation][Late] Alerts shouldn't exceed the 500 series limit of total series sent via telemetry from each cluster [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" e2e test finished As "Passed"
Apr 18 08:00:10.941 I e2e-test/"[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" started
Apr 18 08:00:10.941 - 7s    E e2e-test/"[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" e2e test finished As "Failed"
Apr 18 08:00:17.979 I e2e-test/"[sig-instrumentation][Late] Alerts shouldn't exceed the 500 series limit of total series sent via telemetry from each cluster [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" finishedStatus/Passed
Apr 18 08:00:18.277 I e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" finishedStatus/Passed
Apr 18 08:00:18.666 E e2e-test/"[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" finishedStatus/Failed
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
started: (0/2/333) "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
skipped: (54.2s) 2022-04-18T08:24:07 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"
Apr 18 08:23:12.818 I e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" started
Apr 18 08:23:12.818 - 54s   I e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" e2e test finished As "Skipped"
Apr 18 08:24:07.055 I e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" finishedStatus/Skipped

alert HighOverallControlPlaneCPU fired for 180 seconds with labels: {namespace="openshift-kube-apiserver", severity="warning"} (allowed: high CPU utilization during e2e runs is normal)
Apr 18 16:00:18.630: FAIL: Unexpected alerts fired or pending after the test run:

alert CannotRetrieveUpdates fired for 3118 seconds with labels: {endpoint="metrics", instance="10.253.24.6:9099", job="cluster-version-operator", namespace="openshift-cluster-version", pod="cluster-version-operator-57f968f56-mv9s8", service="cluster-version-operator", severity="warning"}
alert SystemMemoryExceedsReservation fired for 2526 seconds with labels: {node="master1", severity="warning"}
alert etcdMemberCommunicationSlow fired for 30 seconds with labels: {To="cbe753567cf13352", endpoint="etcd-metrics", instance="10.253.24.6:9979", job="etcd", namespace="openshift-etcd", pod="etcd-master2", service="etcd", severity="warning"}

Full Stack Trace
github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0xc001983f20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113 +0xa3
github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0xc001983f20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:64 +0x15c
github.com/onsi/ginkgo/internal/leafnodes.(*ItNode).Run(0xc001986120, 0x8f27f00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/it_node.go:26 +0x87
github.com/onsi/ginkgo/internal/spec.(*Spec).runSample(0xc0028d6f00, 0x0, 0x8f27f00, 0xc00038a7c0)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/spec/spec.go:215 +0x72f
github.com/onsi/ginkgo/internal/spec.(*Spec).Run(0xc0028d6f00, 0x8f27f00, 0xc00038a7c0)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/spec/spec.go:138 +0xf2
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec(0xc001fbcc80, 0xc0028d6f00, 0x0)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:200 +0x111
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs(0xc001fbcc80, 0x1)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:170 +0x147
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc001fbcc80, 0xc002c35398)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:66 +0x117
github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc000352870, 0x8f281c0, 0xc001cfee10, 0x0, 0x0, 0xc000796530, 0x1, 0x1, 0x90018d8, 0xc00038a7c0, ...)
	github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/suite/suite.go:62 +0x426
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001dae7e0, 0xc001372c40, 0x1, 0x1, 0x83256a1, 0x4a53120)
	github.com/openshift/origin/pkg/test/ginkgo/cmd_runtest.go:61 +0x418
main.newRunTestCommand.func1.1()
	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:426 +0x4e
github.com/openshift/origin/test/extended/util.WithCleanup(0xc001abfc18)
	github.com/openshift/origin/test/extended/util/test.go:168 +0x5f
main.newRunTestCommand.func1(0xc001d91180, 0xc001372c40, 0x1, 0x1, 0x0, 0x0)
	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:426 +0x333
github.com/spf13/cobra.(*Command).execute(0xc001d91180, 0xc001372be0, 0x1, 0x1, 0xc001d91180, 0xc001372be0)
	github.com/spf13/cobra@v1.1.3/command.go:852 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc001d90780, 0x0, 0x8f30f20, 0xbfdc960)
	github.com/spf13/cobra@v1.1.3/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.1.3/command.go:897
main.main.func1(0xc001d90780, 0x0, 0x0)
	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:84 +0x94
main.main()
	github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:85 +0x42c
[AfterEach] [sig-instrumentation][Late] Alerts
  github.com/openshift/origin/test/extended/util/client.go:140
STEP: Collecting events from namespace "e2e-test-prometheus-ptvn2".
STEP: Found 5 events.
Apr 18 16:00:18.641: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for execpod: { } Scheduled: Successfully assigned e2e-test-prometheus-ptvn2/execpod to master1
Apr 18 16:00:18.641: INFO: At 2022-04-18 16:00:16 +0800 CST - event for execpod: {multus } AddedInterface: Add eth0 [21.100.1.147/23] from ovn-kubernetes
Apr 18 16:00:18.641: INFO: At 2022-04-18 16:00:16 +0800 CST - event for execpod: {kubelet master1} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" already present on machine
Apr 18 16:00:18.641: INFO: At 2022-04-18 16:00:16 +0800 CST - event for execpod: {kubelet master1} Created: Created container agnhost-container
Apr 18 16:00:18.641: INFO: At 2022-04-18 16:00:16 +0800 CST - event for execpod: {kubelet master1} Started: Started container agnhost-container
Apr 18 16:00:18.643: INFO: POD      NODE     PHASE    GRACE  CONDITIONS
Apr 18 16:00:18.643: INFO: execpod  master1  Running  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 16:00:13 +0800 CST  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 16:00:17 +0800 CST  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 16:00:17 +0800 CST  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 16:00:13 +0800 CST  }]
Apr 18 16:00:18.643: INFO: 
Apr 18 16:00:18.646: INFO: skipping dumping cluster info - cluster too large
[AfterEach] [sig-instrumentation][Late] Alerts
  github.com/openshift/origin/test/extended/util/client.go:141
STEP: Destroying namespace "e2e-test-prometheus-ptvn2" for this suite.
fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Apr 18 16:00:18.630: Unexpected alerts fired or pending after the test run:

alert CannotRetrieveUpdates fired for 3118 seconds with labels: {endpoint="metrics", instance="10.253.24.6:9099", job="cluster-version-operator", namespace="openshift-cluster-version", pod="cluster-version-operator-57f968f56-mv9s8", service="cluster-version-operator", severity="warning"}
alert SystemMemoryExceedsReservation fired for 2526 seconds with labels: {node="master1", severity="warning"}
alert etcdMemberCommunicationSlow fired for 30 seconds with labels: {To="cbe753567cf13352", endpoint="etcd-metrics", instance="10.253.24.6:9979", job="etcd", namespace="openshift-etcd", pod="etcd-master2", service="etcd", severity="warning"}

failed: (7.7s) 2022-04-18T08:00:18 "[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

允许的报警:

Apr 18 16:00:18.630: INFO: Alerts were detected during test run which are allowed:

alert HighOverallControlPlaneCPU fired for 180 seconds with labels: {namespace="openshift-kube-apiserver", severity="warning"} (allowed: high CPU utilization during e2e runs is normal)

不允许的报警:

Apr 18 16:00:18.630: FAIL: Unexpected alerts fired or pending after the test run:

alert CannotRetrieveUpdates fired for 3118 seconds with labels: {endpoint="metrics", instance="10.253.24.6:9099", job="cluster-version-operator", namespace="openshift-cluster-version", pod="cluster-version-operator-57f968f56-mv9s8", service="cluster-version-operator", severity="warning"}
alert SystemMemoryExceedsReservation fired for 2526 seconds with labels: {node="master1", severity="warning"}
alert etcdMemberCommunicationSlow fired for 30 seconds with labels: {To="cbe753567cf13352", endpoint="etcd-metrics", instance="10.253.24.6:9979", job="etcd", namespace="openshift-etcd", pod="etcd-master2", service="etcd", severity="warning"}

问题分析:

etcdMemberCommunicationSlow

root@master0 zsl]# kubectl  -n openshift-etcd get pods -o wide
NAME                                 READY   STATUS      RESTARTS   AGE    IP            NODE      NOMINATED NODE   READINESS GATES
etcd-master0                         4/4     Running     0          3d2h   10.253.24.4   master0   <none>           <none>
etcd-master1                         4/4     Running     0          3d2h   10.253.24.5   master1   <none>           <none>
etcd-master2                         4/4     Running     0          3d2h   10.253.24.6   master2   <none>           <none>
etcd-quorum-guard-6df5f57df7-cx8vf   1/1     Running     0          144m   10.253.24.5   master1   <none>           <none>
etcd-quorum-guard-6df5f57df7-rwsrk   1/1     Running     0          3d2h   10.253.24.4   master0   <none>           <none>
etcd-quorum-guard-6df5f57df7-vpgkv   1/1     Running     0          3d2h   10.253.24.6   master2   <none>           <none>

[root@master0 ~]# kubectl  -n openshift-etcd  get endpoints -o wide
NAME   ENDPOINTS                                                        AGE
etcd   10.253.24.4:2379,10.253.24.5:2379,10.253.24.6:2379 + 3 more...   3d3h
[root@master0 ~]# kubectl  -n openshift-etcd  describe endpoint etcd
error: the server doesn't have a resource type "endpoint"
[root@master0 ~]# kubectl  -n openshift-etcd  describe endpoints etcd
Name:         etcd
Namespace:    openshift-etcd
Labels:       k8s-app=etcd
Annotations:  <none>
Subsets:
  Addresses:          10.253.24.4,10.253.24.5,10.253.24.6
  NotReadyAddresses:  <none>
  Ports:
    Name          Port  Protocol
    ----          ----  --------
    etcd          2379  TCP
    etcd-metrics  9979  TCP

Events:  <none>

Prometheus Cluster Monitoring | Configuring Clusters | OpenShift Container Platform 3.11

EtcdMemberCommunicationSlow

warning

Etcd cluster "Job": member communication with To is taking X_s on etcd instance _Instance.

https://github.com/openshift/cluster-etcd-operator/blob/master/manifests/0000_90_etcd-operator_03_prometheusrule.yaml

    - alert: etcdMemberCommunicationSlow
      annotations:
        description: 'etcd cluster "{{ $labels.job }}": member communication with
          {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance
          }}.'
        summary: etcd cluster member communication is slow.
      expr: |
        histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m]))
        > 0.15
      for: 10m
      labels:
        severity: warning

Histograms and summaries | Prometheus

etcd处理:

etcd对磁盘io敏感, 必须要先固定硬件数据

处理:

CannotRetrieveUpdates：

1927903 – "CannotRetrieveUpdates" - critical error in openshift web console

https://github.com/openshift/cluster-version-operator/blob/master/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L52-L59

- alert: CannotRetrieveUpdates
      annotations:
        summary: Cluster version operator has not retrieved updates in {{ "{{ $value | humanizeDuration }}" }}.
        description: Failure to retrieve updates means that cluster administrators will need to monitor for available updates on their own or risk falling behind on security or other bugfixes. If the failure is expected, you can clear spec.channel in the ClusterVersion object to tell the cluster-version operator to not retrieve updates. Failure reason {{ "{{ with $cluster_operator_conditions := \"cluster_operator_conditions\" | query}}{{range $value := .}}{{if and (eq (label \"name\" $value) \"version\") (eq (label \"condition\" $value) \"RetrievedUpdates\") (eq (label \"endpoint\" $value) \"metrics\") (eq (value $value) 0.0)}}{{label \"reason\" $value}} {{end}}{{end}}{{end}}" }}. {{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} For more information refer to {{ label \"url\" (first $console_url ) }}/settings/cluster/.{{ end }}{{ end }}" }}
      expr: |
        (time()-cluster_version_operator_update_retrieval_timestamp_seconds) >= 3600 and ignoring(condition, name, reason) cluster_operator_conditions{name="version", condition="RetrievedUpdates", endpoint="metrics", reason!="NoChannel"}
      labels:
        severity: warning

https://github.com/openshift/okd/blob/master/KNOWN_ISSUES.md#cannotretrieveupdates-alert

KubeContainerWaiting

1976940 – GCP RT CI failing on firing KubeContainerWaiting due to liveness and readiness probes timing out

- alert: KubeContainerWaiting
      annotations:
        description: pod/{{ $labels.pod }} in namespace {{ $labels.namespace }} on
          container {{ $labels.container}} has been in waiting state for longer than
          1 hour.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting
        summary: Pod container waiting longer than 1 hour
      expr: |
        sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0
      for: 1h
      labels:
        severity: warning

https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetesControlPlane-prometheusRule.yaml#L169-L180

KubePodNotReady

- alert: KubePodNotReady
      annotations:
        description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready
          state for longer than 15 minutes.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodnotready
        summary: Pod has been in a non-ready state for more than 15 minutes.
      expr: |
        sum by (namespace, pod) (
          max by(namespace, pod) (
            kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}
          ) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (
            1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})
          )
        ) > 0
      for: 15m
      labels:
        severity: warning

https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetesControlPlane-prometheusRule.yaml#L26-L42

https://github.com/openshift/cluster-monitoring-operator/issues/72

pod处于non-ready 状态大于15 min

问题解决:

三. [sig-cli] oc observe works as expected [Suite:openshift/conformance/parallel]

问题描述:

error: command "/bin/sh" exited with status code 1

error: command "/bin/sh" exited with status code 2

集群串行执行, 无错误:

	Line 2075: started: (3/233/333) "[sig-cli] Kubectl client Kubectl taint [Serial] should update the taint on a node [Suite:openshift/conformance/serial] [Suite:k8s]"
	Line 2077: passed: (5.1s) 2022-04-18T07:30:58 "[sig-cli] Kubectl client Kubectl taint [Serial] should update the taint on a node [Suite:openshift/conformance/serial] [Suite:k8s]"
	Line 2125: started: (3/242/333) "[sig-cli] Kubectl client Kubectl taint [Serial] should remove all the taints with the same key off a node [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]"
	Line 2127: passed: (5.9s) 2022-04-18T07:33:11 "[sig-cli] Kubectl client Kubectl taint [Serial] should remove all the taints with the same key off a node [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]"
	Line 4097045: started: (7/318/333) "[sig-cli] oc adm cluster-role-reapers [Serial] [Suite:openshift/conformance/serial]"
	Line 4097047: passed: (12.4s) 2022-04-18T08:00:03 "[sig-cli] oc adm cluster-role-reapers [Serial] [Suite:openshift/conformance/serial]"
	Line 4102157: Apr 18 07:30:53.319 I e2e-test/"[sig-cli] Kubectl client Kubectl taint [Serial] should update the taint on a node [Suite:openshift/conformance/serial] [Suite:k8s]" started
	Line 4102158: Apr 18 07:30:53.319 - 5s    I e2e-test/"[sig-cli] Kubectl client Kubectl taint [Serial] should update the taint on a node [Suite:openshift/conformance/serial] [Suite:k8s]" e2e test finished As "Passed"
	Line 4102170: Apr 18 07:30:58.468 I e2e-test/"[sig-cli] Kubectl client Kubectl taint [Serial] should update the taint on a node [Suite:openshift/conformance/serial] [Suite:k8s]" finishedStatus/Passed
	Line 4102501: Apr 18 07:33:05.954 I e2e-test/"[sig-cli] Kubectl client Kubectl taint [Serial] should remove all the taints with the same key off a node [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]" started
	Line 4102502: Apr 18 07:33:05.954 - 5s    I e2e-test/"[sig-cli] Kubectl client Kubectl taint [Serial] should remove all the taints with the same key off a node [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]" e2e test finished As "Passed"
	Line 4102687: Apr 18 07:33:11.902 I e2e-test/"[sig-cli] Kubectl client Kubectl taint [Serial] should remove all the taints with the same key off a node [Skipped:SingleReplicaTopology] [Suite:openshift/conformance/serial] [Suite:k8s]" finishedStatus/Passed
	Line 4127678: Apr 18 07:59:51.543 I e2e-test/"[sig-cli] oc adm cluster-role-reapers [Serial] [Suite:openshift/conformance/serial]" started
	Line 4127679: Apr 18 07:59:51.543 - 12s   I e2e-test/"[sig-cli] oc adm cluster-role-reapers [Serial] [Suite:openshift/conformance/serial]" e2e test finished As "Passed"
	Line 4127720: Apr 18 08:00:03.941 I e2e-test/"[sig-cli] oc adm cluster-role-reapers [Serial] [Suite:openshift/conformance/serial]" finishedStatus/Passed

集群并行执行, 出现错误:

started: (18/2732/2748) "[sig-cli] oc observe works as expected [Suite:openshift/conformance/parallel]"

passed: (11.9s) 2022-04-18T07:55:16 "[sig-imageregistry][Feature:ImageInfo] Image info should display information about images [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

started: (18/2733/2748) "[sig-auth][Feature:OpenShiftAuthorization] RBAC proxy for openshift authz  RunLegacyLocalRoleEndpoint should succeed [Suite:openshift/conformance/parallel]"

passed: (13.7s) 2022-04-18T07:55:17 "[sig-apps][Feature:DeploymentConfig] deploymentconfigs with failing hook should get all logs from retried hooks [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

started: (18/2734/2748) "[sig-arch] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]"

[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1453
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1453
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-auth][Feature:OAuthServer] [Token Expiration]
  github.com/openshift/origin/test/extended/util/client.go:142
STEP: Creating a kubernetes client
[BeforeEach] [sig-auth][Feature:OAuthServer] [Token Expiration]
  github.com/openshift/origin/test/extended/util/client.go:116
Apr 18 15:54:02.819: INFO: configPath is now "/tmp/configfile055747163"
Apr 18 15:54:02.820: INFO: The user is now "e2e-test-oauth-expiration-m5ws9-user"
Apr 18 15:54:02.820: INFO: Creating project "e2e-test-oauth-expiration-m5ws9"
Apr 18 15:54:02.998: INFO: Waiting on permissions in project "e2e-test-oauth-expiration-m5ws9" ...
Apr 18 15:54:03.002: INFO: Waiting for ServiceAccount "default" to be provisioned...
Apr 18 15:54:03.121: INFO: Waiting for service account "default" secrets (default-dockercfg-svzlh,default-dockercfg-svzlh) to include dockercfg/token ...
Apr 18 15:54:03.261: INFO: Waiting for ServiceAccount "deployer" to be provisioned...
Apr 18 15:54:03.387: INFO: Waiting for ServiceAccount "builder" to be provisioned...
Apr 18 15:54:03.506: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned...
Apr 18 15:54:03.520: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned...
Apr 18 15:54:03.780: INFO: Waiting for RoleBinding "system:deployers" to be provisioned...
Apr 18 15:54:04.753: INFO: Project "e2e-test-oauth-expiration-m5ws9" has been fully provisioned.
[BeforeEach] [sig-auth][Feature:OAuthServer] [Token Expiration]
  github.com/openshift/origin/test/extended/oauth/expiration.go:30
Apr 18 15:54:04.761: INFO: Running 'oc --namespace=e2e-test-oauth-expiration-m5ws9 --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir977370604/test/extended/testdata/oauthserver/cabundle-cm.yaml'
configmap/service-ca created
Apr 18 15:54:05.025: INFO: Created resources defined in cabundle-cm.yaml
Apr 18 15:54:05.025: INFO: Running 'oc --namespace=e2e-test-oauth-expiration-m5ws9 --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir977370604/test/extended/testdata/oauthserver/oauth-sa.yaml'
serviceaccount/e2e-oauth created
Apr 18 15:54:05.205: INFO: Created resources defined in oauth-sa.yaml
Apr 18 15:54:05.205: INFO: Running 'oc --namespace=e2e-test-oauth-expiration-m5ws9 --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir977370604/test/extended/testdata/oauthserver/oauth-network.yaml'
service/test-oauth-svc created
route.route.openshift.io/test-oauth-route created
Apr 18 15:54:05.497: INFO: Created resources defined in oauth-network.yaml
Apr 18 15:54:05.522: INFO: Created: ClusterRoleBinding e2e-test-oauth-expiration-m5ws9
Apr 18 15:54:05.574: INFO: Created:  /htpasswd
Apr 18 15:54:05.583: INFO: Created: Secret e2e-test-oauth-expiration-m5ws9/session-secret
Apr 18 15:54:05.685: INFO: Created: ConfigMap e2e-test-oauth-expiration-m5ws9/oauth-config
Apr 18 15:54:05.799: INFO: Created: Pod e2e-test-oauth-expiration-m5ws9/test-oauth-server
Apr 18 15:54:05.799: INFO: Waiting for user 'system:serviceaccount:e2e-test-oauth-expiration-m5ws9:e2e-oauth' to be authorized to * the * resource
Apr 18 15:54:05.803: INFO: Waiting for the OAuth server pod to be ready
Apr 18 15:54:05.826: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) <nil>

Apr 18 15:54:06.889: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:,ContainerID:,Started:*false,}
}

Apr 18 15:54:07.844: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:,ContainerID:,Started:*false,}
}

Apr 18 15:54:08.849: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:,ContainerID:,Started:*false,}
}

Apr 18 15:54:09.841: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2022-04-18 15:54:08 +0800 CST,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ContainerID:cri-o://71f41ac0b4072de72128c36923f95adfc0bb05a2647067cc3f97e1245b188b3d,Started:*true,}
}

Apr 18 15:54:10.863: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2022-04-18 15:54:08 +0800 CST,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ContainerID:cri-o://71f41ac0b4072de72128c36923f95adfc0bb05a2647067cc3f97e1245b188b3d,Started:*true,}
}

Apr 18 15:54:11.895: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2022-04-18 15:54:08 +0800 CST,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ContainerID:cri-o://71f41ac0b4072de72128c36923f95adfc0bb05a2647067cc3f97e1245b188b3d,Started:*true,}
}

Apr 18 15:54:12.833: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2022-04-18 15:54:08 +0800 CST,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ContainerID:cri-o://71f41ac0b4072de72128c36923f95adfc0bb05a2647067cc3f97e1245b188b3d,Started:*true,}
}

Apr 18 15:54:13.845: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2022-04-18 15:54:08 +0800 CST,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ContainerID:cri-o://71f41ac0b4072de72128c36923f95adfc0bb05a2647067cc3f97e1245b188b3d,Started:*true,}
}

Apr 18 15:54:14.852: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2022-04-18 15:54:08 +0800 CST,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ContainerID:cri-o://71f41ac0b4072de72128c36923f95adfc0bb05a2647067cc3f97e1245b188b3d,Started:*true,}
}

Apr 18 15:54:15.875: INFO: OAuth server pod is not ready: 
Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) {
 (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2022-04-18 15:54:08 +0800 CST,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ImageID:image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596,ContainerID:cri-o://71f41ac0b4072de72128c36923f95adfc0bb05a2647067cc3f97e1245b188b3d,Started:*true,}
}

Apr 18 15:54:16.875: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:16.878: INFO: Waiting for the OAuth server route to be ready: EOF
Apr 18 15:54:17.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:17.881: INFO: Waiting for the OAuth server route to be ready: EOF
Apr 18 15:54:18.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:18.882: INFO: Waiting for the OAuth server route to be ready: EOF
Apr 18 15:54:19.880: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:19.884: INFO: Waiting for the OAuth server route to be ready: EOF
Apr 18 15:54:20.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:20.901: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:21.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:21.884: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:22.925: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:22.935: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:23.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:23.900: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:24.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:24.892: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:25.886: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:25.902: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:26.887: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:26.901: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:27.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:27.910: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:28.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:28.890: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:29.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:29.893: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:30.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:30.886: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:31.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:31.891: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:32.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:32.884: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:33.883: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:33.906: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:34.881: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:34.894: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:35.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:35.884: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:36.880: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:36.891: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:37.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:37.887: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:38.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:38.924: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:39.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:39.888: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:40.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:40.884: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:41.880: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:41.887: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:42.896: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:42.907: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:43.880: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:43.889: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:44.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:44.890: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:45.884: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:45.898: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:46.885: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:46.928: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:47.883: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:47.910: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:48.880: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:48.887: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:49.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:49.885: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:50.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:50.905: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:51.880: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:51.899: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:52.889: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:52.905: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:53.881: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:53.891: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:54.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:54.888: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:55.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:55.883: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:56.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:56.900: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:57.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:57.898: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:58.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:58.888: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:54:59.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:54:59.902: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:00.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:00.889: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:01.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:01.888: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:02.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:02.890: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:03.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:03.883: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:04.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:04.884: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:05.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:05.886: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:06.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:06.886: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:07.881: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:07.888: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:08.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:08.888: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:09.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:09.887: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:10.880: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:10.891: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:11.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:11.893: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:12.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:12.886: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:13.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:13.886: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:14.878: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:14.883: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:15.879: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:15.896: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:16.882: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:16.901: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
Apr 18 15:55:16.901: INFO: Waiting for the OAuth server route to be ready
Apr 18 15:55:16.917: INFO: Waiting for the OAuth server route to be ready: x509: certificate signed by unknown authority
[AfterEach] [sig-auth][Feature:OAuthServer] [Token Expiration]
  github.com/openshift/origin/test/extended/util/client.go:140
STEP: Collecting events from namespace "e2e-test-oauth-expiration-m5ws9".
STEP: Found 5 events.
Apr 18 15:55:16.936: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for test-oauth-server: { } Scheduled: Successfully assigned e2e-test-oauth-expiration-m5ws9/test-oauth-server to master2
Apr 18 15:55:16.936: INFO: At 2022-04-18 15:54:08 +0800 CST - event for test-oauth-server: {multus } AddedInterface: Add eth0 [21.100.3.47/23] from ovn-kubernetes
Apr 18 15:55:16.936: INFO: At 2022-04-18 15:54:08 +0800 CST - event for test-oauth-server: {kubelet master2} Pulled: Container image "image.cestc.cn/ccos-ceake/oauth-server@sha256:fca7bab88904f8309e75248f84d07a71769a81bcd9d79cf1b61096086a4c8596" already present on machine
Apr 18 15:55:16.937: INFO: At 2022-04-18 15:54:08 +0800 CST - event for test-oauth-server: {kubelet master2} Created: Created container oauth-server
Apr 18 15:55:16.937: INFO: At 2022-04-18 15:54:08 +0800 CST - event for test-oauth-server: {kubelet master2} Started: Started container oauth-server
Apr 18 15:55:16.968: INFO: POD                NODE     PHASE    GRACE  CONDITIONS
Apr 18 15:55:16.968: INFO: test-oauth-server  master2  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 15:54:05 +0800 CST  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 15:54:15 +0800 CST  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 15:54:15 +0800 CST  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-04-18 15:54:05 +0800 CST  }]
Apr 18 15:55:16.968: INFO: 
Apr 18 15:55:16.985: INFO: skipping dumping cluster info - cluster too large
Apr 18 15:55:17.005: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-oauth-expiration-m5ws9-user}, err: <nil>
Apr 18 15:55:17.041: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-oauth-expiration-m5ws9}, err: <nil>
Apr 18 15:55:17.064: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~JlfYVET_sRJmauqiEQnJ9Yh9UxhOg9P4zI_7ffYAka8}, err: <nil>
[AfterEach] [sig-auth][Feature:OAuthServer] [Token Expiration]
  github.com/openshift/origin/test/extended/util/client.go:141
STEP: Destroying namespace "e2e-test-oauth-expiration-m5ws9" for this suite.
[AfterEach] [sig-auth][Feature:OAuthServer] [Token Expiration]
  github.com/openshift/origin/test/extended/oauth/expiration.go:36
Apr 18 15:55:17.082: INFO: Running 'oc --namespace= --kubeconfig=/root/.kube/config delete clusterrolebindings.rbac.authorization.k8s.io e2e-test-oauth-expiration-m5ws9'
clusterrolebinding.rbac.authorization.k8s.io "e2e-test-oauth-expiration-m5ws9" deleted
fail [github.com/openshift/origin/test/extended/oauth/expiration.go:33]: Unexpected error:
    <*errors.errorString | 0xc0002feb10>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

failed: (1m16s) 2022-04-18T07:55:17 "[sig-auth][Feature:OAuthServer] [Token Expiration] Using a OAuth client with a non-default token max age to generate tokens that expire shortly works as expected when using a token authorization flow [Suite:openshift/conformance/parallel]"

passed: (2m14s) 2022-04-18T07:55:17 "[sig-builds][Feature:Builds][timing] capture build stages and durations  should record build stages and durations for docker [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

passed: (2.1s) 2022-04-18T07:55:18 "[sig-auth][Feature:OpenShiftAuthorization] RBAC proxy for openshift authz  RunLegacyLocalRoleEndpoint should succeed [Suite:openshift/conformance/parallel]"

passed: (2.1s) 2022-04-18T07:55:19 "[sig-arch] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]"

passed: (1m52s) 2022-04-18T07:55:24 "[sig-builds][Feature:Builds] prune builds based on settings in the buildconfig  should prune failed builds based on the failedBuildsHistoryLimit setting [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

passed: (2m34s) 2022-04-18T07:55:24 "[sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal]"

passed: (1m53s) 2022-04-18T07:55:25 "[sig-network][Feature:Router] The HAProxy router should serve the correct routes when scoped to a single namespace and label set [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

passed: (2m4s) 2022-04-18T07:55:26 "[sig-apps][Feature:DeploymentConfig] deploymentconfigs keep the deployer pod invariant valid should deal with cancellation of running deployment [Suite:openshift/conformance/parallel]"

passed: (1m45s) 2022-04-18T07:55:27 "[sig-network] services when using OpenshiftSDN in a mode that does not isolate namespaces by default should allow connections to pods in different namespaces on different nodes via service IPs [Suite:openshift/conformance/parallel]"

[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1453
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1453
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:61
[BeforeEach] [sig-cli] oc observe
  github.com/openshift/origin/test/extended/util/client.go:142
STEP: Creating a kubernetes client
[It] works as expected [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/cli/observe.go:17
STEP: basic scenarios
Apr 18 15:55:17.116: INFO: Running 'oc --kubeconfig=/root/.kube/config observe'
Apr 18 15:55:17.219: INFO: Error running /usr/bin/oc --kubeconfig=/root/.kube/config observe:
StdOut>
error: you must specify at least one argument containing the resource to observe
StdErr>
error: you must specify at least one argument containing the resource to observe

Apr 18 15:55:17.219: INFO: Running 'oc --kubeconfig=/root/.kube/config observe serviceaccounts --once'
Apr 18 15:55:17.446: INFO: Running 'oc --kubeconfig=/root/.kube/config observe daemonsets --once'
Apr 18 15:55:17.749: INFO: Running 'oc --kubeconfig=/root/.kube/config observe clusteroperators --once'
Apr 18 15:55:17.992: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --once --all-namespaces'
Apr 18 15:55:18.198: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --once --all-namespaces --print-metrics-on-exit'
Apr 18 15:55:18.465: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --once --names echo'
Apr 18 15:55:18.852: INFO: Error running /usr/bin/oc --kubeconfig=/root/.kube/config observe services --once --names echo:
StdOut>
error: --delete and --names must both be specified
StdErr>
error: --delete and --names must both be specified

Apr 18 15:55:18.852: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --exit-after=1s'
Apr 18 15:55:20.116: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --exit-after=3s --all-namespaces --print-metrics-on-exit'
Apr 18 15:55:23.258: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --exit-after=3s --all-namespaces --names echo --names default/notfound --delete echo --delete remove'
STEP: error counting
Apr 18 15:55:26.597: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --maximum-errors=1 -- /bin/sh -c exit 1'
Apr 18 15:55:26.780: INFO: Error running /usr/bin/oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --maximum-errors=1 -- /bin/sh -c exit 1:
StdOut>
# 2022-04-18T15:55:26+08:00 Sync started
# 2022-04-18T15:55:26+08:00 Sync 5965	/bin/sh -c "exit 1" openshift-kube-controller-manager kube-controller-manager ""
error: command "/bin/sh" exited with status code 1
error: reached maximum error limit of 1, exiting
StdErr>
# 2022-04-18T15:55:26+08:00 Sync started
# 2022-04-18T15:55:26+08:00 Sync 5965	/bin/sh -c "exit 1" openshift-kube-controller-manager kube-controller-manager ""
error: command "/bin/sh" exited with status code 1
error: reached maximum error limit of 1, exiting

Apr 18 15:55:26.780: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --retry-on-exit-code=2 --maximum-errors=1 --loglevel=4 -- /bin/sh -c exit 2'
Apr 18 15:55:27.063: INFO: Error running /usr/bin/oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --retry-on-exit-code=2 --maximum-errors=1 --loglevel=4 -- /bin/sh -c exit 2:
StdOut>
I0418 15:55:26.970635 2404130 observe.go:438] Listening on :11251 at /metrics and /healthz
I0418 15:55:26.970715 2404130 reflector.go:255] Listing and watching <unspecified> from observer
# 2022-04-18T15:55:27+08:00 Sync started
I0418 15:55:27.030943 2404130 observe.go:648] Processing Sync []: &unstructured.Unstructured{Object:map[string]interface {}{"apiVersion":"v1", "kind":"Service", "metadata":map[string]interface {}{"annotations":map[string]interface {}{"service.alpha.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354", "service.beta.openshift.io/serving-cert-secret-name":"cluster-monitoring-operator-tls", "service.beta.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354"}, "creationTimestamp":"2022-04-15T07:06:18Z", "labels":map[string]interface {}{"app":"cluster-monitoring-operator"}, "managedFields":[]interface {}{map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{".":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-secret-name":map[string]interface {}{}}, "f:labels":map[string]interface {}{".":map[string]interface {}{}, "f:app":map[string]interface {}{}}}, "f:spec":map[string]interface {}{"f:clusterIP":map[string]interface {}{}, "f:internalTrafficPolicy":map[string]interface {}{}, "f:ports":map[string]interface {}{".":map[string]interface {}{}, "k:{\"port\":8443,\"protocol\":\"TCP\"}":map[string]interface {}{".":map[string]interface {}{}, "f:name":map[string]interface {}{}, "f:port":map[string]interface {}{}, "f:protocol":map[string]interface {}{}, "f:targetPort":map[string]interface {}{}}}, "f:selector":map[string]interface {}{}, "f:sessionAffinity":map[string]interface {}{}, "f:type":map[string]interface {}{}}}, "manager":"operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}, map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{"f:service.alpha.openshift.io/serving-cert-signed-by":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-signed-by":map[string]interface {}{}}}}, "manager":"service-ca-operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}}, "name":"cluster-monitoring-operator", "namespace":"openshift-monitoring", "resourceVersion":"6618", "uid":"f65e024e-baf4-4eb6-acce-13b913bcc13a"}, "spec":map[string]interface {}{"clusterIP":"None", "clusterIPs":[]interface {}{"None"}, "internalTrafficPolicy":"Cluster", "ipFamilies":[]interface {}{"IPv4"}, "ipFamilyPolicy":"SingleStack", "ports":[]interface {}{map[string]interface {}{"name":"https", "port":8443, "protocol":"TCP", "targetPort":"https"}}, "selector":map[string]interface {}{"app":"cluster-monitoring-operator"}, "sessionAffinity":"None", "type":"ClusterIP"}, "status":map[string]interface {}{"loadBalancer":map[string]interface {}{}}}}
# 2022-04-18T15:55:27+08:00 Sync 6618	/bin/sh -c "exit 2" openshift-monitorer'ring cluster-monitoring-operator ""
I0418 15:55:27.046210 2404130 metric.go:86] retrying command: exit status 2
I0418 15:55:27.048102 2404130 metric.go:86] retrying command: exit status 2
error: command "/bin/sh" exited with status code 2
error: reached maximum error limit of 1, exiting
StdErr>
I0418 15:55:26.970635 2404130 observe.go:438] Listening on :11251 at /metrics and /healthz
I0418 15:55:26.970715 2404130 reflector.go:255] Listing and watching <unspecified> from observer
# 2022-04-18T15:55:27+08:00 Sync started
I0418 15:55:27.030943 2404130 observe.go:648] Processing Sync []: &unstructured.Unstructured{Object:map[string]interface {}{"apiVersion":"v1", "kind":"Service", "metadata":map[string]interface {}{"annotations":map[string]interface {}{"service.alpha.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354", "service.beta.openshift.io/serving-cert-secret-name":"cluster-monitoring-operator-tls", "service.beta.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354"}, "creationTimestamp":"2022-04-15T07:06:18Z", "labels":map[string]interface {}{"app":"cluster-monitoring-operator"}, "managedFields":[]interface {}{map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{".":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-secret-name":map[string]interface {}{}}, "f:labels":map[string]interface {}{".":map[string]interface {}{}, "f:app":map[string]interface {}{}}}, "f:spec":map[string]interface {}{"f:clusterIP":map[string]interface {}{}, "f:internalTrafficPolicy":map[string]interface {}{}, "f:ports":map[string]interface {}{".":map[string]interface {}{}, "k:{\"port\":8443,\"protocol\":\"TCP\"}":map[string]interface {}{".":map[string]interface {}{}, "f:name":map[string]interface {}{}, "f:port":map[string]interface {}{}, "f:protocol":map[string]interface {}{}, "f:targetPort":map[string]interface {}{}}}, "f:selector":map[string]interface {}{}, "f:sessionAffinity":map[string]interface {}{}, "f:type":map[string]interface {}{}}}, "manager":"operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}, map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{"f:service.alpha.openshift.io/serving-cert-signed-by":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-signed-by":map[string]interface {}{}}}}, "manager":"service-ca-operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}}, "name":"cluster-monitoring-operator", "namespace":"openshift-monitoring", "resourceVersion":"6618", "uid":"f65e024e-baf4-4eb6-acce-13b913bcc13a"}, "spec":map[string]interface {}{"clusterIP":"None", "clusterIPs":[]interface {}{"None"}, "internalTrafficPolicy":"Cluster", "ipFamilies":[]interface {}{"IPv4"}, "ipFamilyPolicy":"SingleStack", "ports":[]interface {}{map[string]interface {}{"name":"https", "port":8443, "protocol":"TCP", "targetPort":"https"}}, "selector":map[string]interface {}{"app":"cluster-monitoring-operator"}, "sessionAffinity":"None", "type":"ClusterIP"}, "status":map[string]interface {}{"loadBalancer":map[string]interface {}{}}}}
# 2022-04-18T15:55:27+08:00 Sync 6618	/bin/sh -c "exit 2" openshift-monitoring cluster-monitoring-operator ""
I0418 15:55:27.046210 2404130 metric.go:86] retrying command: exit status 2
I0418 15:55:27.048102 2404130 metric.go:86] retrying command: exit status 2
error: command "/bin/sh" exited with status code 2
error: reached maximum error limit of 1, exiting

STEP: argument templates
Apr 18 15:55:27.063: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --once --all-namespaces --template='{ .spec.clusterIP }''
[AfterEach] [sig-cli] oc observe
  github.com/openshift/origin/test/extended/util/client.go:140
[AfterEach] [sig-cli] oc observe
  github.com/openshift/origin/test/extended/util/client.go:141
fail [github.com/openshift/origin/test/extended/cli/observe.go:71]: Expected
    <string>: # 2022-04-18T15:55:27+08:00 Sync started
    # 2022-04-18T15:55:27+08:00 Sync 5552	"" openshift-cloud-credential-operator cco-metrics '21.101.210.238'
    # 2022-04-18T15:55:27+08:00 Sync 5990	"" openshift-cluster-node-tuning-operator node-tuning-operator 'None'
    # 2022-04-18T15:55:27+08:00 Sync 6689	"" openshift-monitoring telemeter-client 'None'
    # 2022-04-18T15:55:27+08:00 Sync 6335	"" openshift-controller-manager controller-manager '21.101.165.77'
    # 2022-04-18T15:55:27+08:00 Sync 5805	"" openshift-kube-storage-version-migrator-operator metrics '21.101.130.80'
    # 2022-04-18T15:55:27+08:00 Sync 8367	"" openshift-operator-lifecycle-manager packageserver-service '21.101.111.75'
    # 2022-04-18T15:55:27+08:00 Sync 1492699	"" e2e-test-router-metrics-qmgwc weightedendpoints1 '21.101.59.216'
    # 2022-04-18T15:55:27+08:00 Sync 17109	"" openshift-monitoring grafana '21.101.251.250'
    # 2022-04-18T15:55:27+08:00 Sync 2850	"" openshift-network-diagnostics network-check-target '21.101.31.159'
    # 2022-04-18T15:55:27+08:00 Sync 1494394	"" e2e-test-oauth-ldap-idp-d7xm7 openldap-server '21.101.254.6'
    # 2022-04-18T15:55:27+08:00 Sync 6898	"" openshift-dns dns-default '21.101.0.10'
    # 2022-04-18T15:55:27+08:00 Sync 5646	"" openshift-kube-apiserver apiserver '21.101.39.158'
    # 2022-04-18T15:55:27+08:00 Sync 7143	"" openshift-marketplace redhat-marketplace '21.101.43.120'
    # 2022-04-18T15:55:27+08:00 Sync 17111	"" openshift-monitoring prometheus-k8s '21.101.250.115'
    # 2022-04-18T15:55:27+08:00 Sync 10831	"" openshift-cluster-samples-operator metrics 'None'
    # 2022-04-18T15:55:27+08:00 Sync 5770	"" openshift-cluster-storage-operator csi-snapshot-webhook '21.101.160.86'
    # 2022-04-18T15:55:27+08:00 Sync 7410	"" openshift-ingress router-internal-default '21.101.79.225'
    # 2022-04-18T15:55:27+08:00 Sync 5896	"" openshift-machine-api machine-api-operator-webhook '21.101.7.142'
    # 2022-04-18T15:55:27+08:00 Sync 6583	"" openshift-monitoring node-exporter 'None'
    # 2022-04-18T15:55:27+08:00 Sync 1489298	"" e2e-test-weighted-router-m87kt weightedendpoints2 '21.101.32.147'
    # 2022-04-18T15:55:27+08:00 Sync 6101	"" openshift-kube-scheduler scheduler '21.101.80.179'
    # 2022-04-18T15:55:27+08:00 Sync 6353	"" openshift-apiserver api '21.101.180.217'
    # 2022-04-18T15:55:27+08:00 Sync 5921	"" openshift-ovn-kubernetes ovn-kubernetes-master 'None'
    # 2022-04-18T15:55:27+08:00 Sync 7179	"" openshift-marketplace certified-operators '21.101.85.127'
    # 2022-04-18T15:55:27+08:00 Sync 17107	"" openshift-monitoring alertmanager-main '21.101.130.77'
    # 2022-04-18T15:55:27+08:00 Sync 6369	"" openshift-authentication oauth-openshift '21.101.168.81'
    # 2022-04-18T15:55:27+08:00 Sync 5912	"" openshift-machine-api cluster-baremetal-webhook-service '21.101.183.123'
    # 2022-04-18T15:55:27+08:00 Sync 5832	"" openshift-etcd-operator metrics '21.101.215.223'
    # 2022-04-18T15:55:27+08:00 Sync 1492941	"" e2e-test-router-idling-4nfzg idle-test '21.101.53.198'
    # 2022-04-18T15:55:27+08:00 Sync 2827	"" openshift-network-diagnostics network-check-source 'None'
    # 2022-04-18T15:55:27+08:00 Sync 6872	"" default openshift ''
    # 2022-04-18T15:55:27+08:00 Sync 5111	"" openshift-apiserver check-endpoints '21.101.247.68'
    # 2022-04-18T15:55:27+08:00 Sync 7149	"" openshift-marketplace redhat-operators '21.101.88.107'
    # 2022-04-18T15:55:27+08:00 Sync 6223	"" openshift-config-operator metrics '21.101.67.58'
    # 2022-04-18T15:55:27+08:00 Sync 6609	"" openshift-monitoring openshift-state-metrics 'None'
    # 2022-04-18T15:55:27+08:00 Sync 1487733	"" e2e-test-build-service-xdfn4 hello-nodejs '21.101.86.134'
    # 2022-04-18T15:55:27+08:00 Sync 19815	"" openshift-console console '21.101.75.225'
    # 2022-04-18T15:55:27+08:00 Sync 7206	"" openshift-marketplace community-operators '21.101.74.27'
    # 2022-04-18T15:55:27+08:00 Sync 5517	"" openshift-multus network-metrics-service 'None'
    # 2022-04-18T15:55:27+08:00 Sync 57687	"" openshift-image-registry image-registry '21.101.207.123'
    # 2022-04-18T15:55:27+08:00 Sync 5904	"" openshift-image-registry image-registry-operator 'None'
    # 2022-04-18T15:55:27+08:00 Sync 1489292	"" e2e-test-weighted-router-m87kt weightedendpoints1 '21.101.157.148'
    # 2022-04-18T15:55:27+08:00 Sync 6324	"" openshift-multus multus-admission-controller '21.101.213.142'
    # 2022-04-18T15:55:27+08:00 Sync 1487005	"" e2e-test-router-scoped-drhjv endpoints '21.101.29.41'
    # 2022-04-18T15:55:27+08:00 Sync 6311	"" openshift-operator-lifecycle-manager olm-operator-metrics '21.101.89.234'
    # 2022-04-18T15:55:27+08:00 Sync 5705	"" openshift-apiserver-operator metrics '21.101.194.87'
    # 2022-04-18T15:55:27+08:00 Sync 6269	"" openshift-authentication-operator metrics '21.101.3.139'
    # 2022-04-18T15:55:27+08:00 Sync 6606	"" openshift-monitoring thanos-querier '21.101.253.65'
    # 2022-04-18T15:55:27+08:00 Sync 6242	"" openshift-cluster-storage-operator cluster-storage-operator-metrics '21.101.98.33'
    # 2022-04-18T15:55:27+08:00 Sync 5513	"" openshift-insights metrics '21.101.221.77'
    # 2022-04-18T15:55:27+08:00 Sync 6374	"" openshift-oauth-apiserver api '21.101.172.209'
    # 2022-04-18T15:55:27+08:00 Sync 5823	"" openshift-service-ca-operator metrics '21.101.210.7'
    # 2022-04-18T15:55:27+08:00 Sync 1493134	"" e2e-test-oauth-server-headers-psnwf test-oauth-svc '21.101.194.196'
    # 2022-04-18T15:55:27+08:00 Sync 5797	"" openshift-machine-api machine-api-operator '21.101.123.100'
    # 2022-04-18T15:55:27+08:00 Sync 18058	"" openshift-monitoring alertmanager-operated 'None'
    # 2022-04-18T15:55:27+08:00 Sync 5525	"" openshift-marketplace marketplace-operator-metrics '21.101.223.158'
    # 2022-04-18T15:55:27+08:00 Sync 5690	"" openshift-monitoring prometheus-operator 'None'
    # 2022-04-18T15:55:27+08:00 Sync 6086	"" openshift-operator-lifecycle-manager catalog-operator-metrics '21.101.175.1'
    # 2022-04-18T15:55:27+08:00 Sync 6111	"" openshift-cluster-version cluster-version-operator '21.101.214.70'
    # 2022-04-18T15:55:27+08:00 Sync 6304	"" openshift-controller-manager-operator metrics '21.101.130.66'
    # 2022-04-18T15:55:27+08:00 Sync 5983	"" openshift-etcd etcd '21.101.131.158'
    # 2022-04-18T15:55:27+08:00 Sync 6260	"" openshift-machine-api cluster-autoscaler-operator '21.101.90.14'
    # 2022-04-18T15:55:27+08:00 Sync 5569	"" openshift-kube-apiserver-operator metrics '21.101.69.58'
    # 2022-04-18T15:55:27+08:00 Sync 19724	"" openshift-console-operator metrics '21.101.236.94'
    # 2022-04-18T15:55:27+08:00 Sync 17104	"" openshift-monitoring prometheus-k8s-thanos-sidecar 'None'
    # 2022-04-18T15:55:27+08:00 Sync 5878	"" openshift-ingress-operator metrics '21.101.54.233'
    # 2022-04-18T15:55:27+08:00 Sync 6206	"" openshift-kube-controller-manager-operator metrics '21.101.95.175'
    # 2022-04-18T15:55:27+08:00 Sync 6566	"" openshift-monitoring kube-state-metrics 'None'
    # 2022-04-18T15:55:27+08:00 Sync 231	"" default kubernetes '21.101.0.1'
    # 2022-04-18T15:55:27+08:00 Sync 5522	"" openshift-machine-config-operator machine-config-daemon '21.101.114.149'
    # 2022-04-18T15:55:27+08:00 Sync 1492703	"" e2e-test-router-metrics-qmgwc weightedendpoints2 '21.101.119.199'
    # 2022-04-18T15:55:27+08:00 Sync 6016	"" openshift-dns-operator metrics '21.101.105.34'
    # 2022-04-18T15:55:27+08:00 Sync 6618	"" openshift-monitoring cluster-monitoring-operator 'None'
    # 2022-04-18T15:55:27+08:00 Sync 1487683	"" e2e-test-unprivileged-router-w5lst endpoints '21.101.132.106'
    # 2022-04-18T15:55:27+08:00 Sync 6076	"" openshift-machine-api cluster-baremetal-operator-service '21.101.167.114'
    # 2022-04-18T15:55:27+08:00 Sync 6001	"" openshift-kube-scheduler-operator metrics '21.101.154.246'
    # 2022-04-18T15:55:27+08:00 Sync 5965	"" openshift-kube-controller-manager kube-controller-manager '21.101.205.32'
    # 2022-04-18T15:55:27+08:00 Sync 5715	"" openshift-cluster-machine-approver machine-approver 'None'
    # 2022-04-18T15:55:27+08:00 Sync 6680	"" openshift-monitoring prometheus-adapter '21.101.129.31'
    # 2022-04-18T15:55:27+08:00 Sync 6047	"" kube-system kubelet 'None'
    # 2022-04-18T15:55:27+08:00 Sync 18011	"" openshift-monitoring prometheus-operated 'None'
    # 2022-04-18T15:55:27+08:00 Sync 2696	"" openshift-ovn-kubernetes ovnkube-db 'None'
    # 2022-04-18T15:55:27+08:00 Sync 165547	"" e2e-statefulset-5018 test 'None'
    # 2022-04-18T15:55:27+08:00 Sync 15564	"" openshift-ingress-canary ingress-canary '21.101.240.32'
    # 2022-04-18T15:55:27+08:00 Sync 6095	"" openshift-ovn-kubernetes ovn-kubernetes-node 'None'
    # 2022-04-18T15:55:27+08:00 Sync 5681	"" openshift-cluster-storage-operator csi-snapshot-controller-operator-metrics '21.101.55.254'
    # 2022-04-18T15:55:27+08:00 Sync 5670	"" openshift-machine-api machine-api-controllers '21.101.33.96'
    # 2022-04-18T15:55:27+08:00 Sync 19835	"" openshift-console downloads '21.101.153.77'
    # 2022-04-18T15:55:27+08:00 Sync ended
To satisfy at least one of these matchers: [%!s(*matchers.ContainSubstringMatcher=&{172.30.0.1 []}) %!s(*matchers.ContainSubstringMatcher=&{fd02::1 []})]

failed: (11.3s) 2022-04-18T07:55:27 "[sig-cli] oc observe works as expected [Suite:openshift/conformance/parallel]"

Apr 18 15:55:26.780: INFO: Running 'oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --retry-on-exit-code=2 --maximum-errors=1 --loglevel=4 -- /bin/sh -c exit 2'
Apr 18 15:55:27.063: INFO: Error running /usr/bin/oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --retry-on-exit-code=2 --maximum-errors=1 --loglevel=4 -- /bin/sh -c exit 2:
StdOut>
I0418 15:55:26.970635 2404130 observe.go:438] Listening on :11251 at /metrics and /healthz
I0418 15:55:26.970715 2404130 reflector.go:255] Listing and watching <unspecified> from observer
# 2022-04-18T15:55:27+08:00 Sync started
I0418 15:55:27.030943 2404130 observe.go:648] Processing Sync []: &unstructured.Unstructured{Object:map[string]interface {}{"apiVersion":"v1", "kind":"Service", "metadata":map[string]interface {}{"annotations":map[string]interface {}{"service.alpha.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354", "service.beta.openshift.io/serving-cert-secret-name":"cluster-monitoring-operator-tls", "service.beta.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354"}, "creationTimestamp":"2022-04-15T07:06:18Z", "labels":map[string]interface {}{"app":"cluster-monitoring-operator"}, "managedFields":[]interface {}{map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{".":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-secret-name":map[string]interface {}{}}, "f:labels":map[string]interface {}{".":map[string]interface {}{}, "f:app":map[string]interface {}{}}}, "f:spec":map[string]interface {}{"f:clusterIP":map[string]interface {}{}, "f:internalTrafficPolicy":map[string]interface {}{}, "f:ports":map[string]interface {}{".":map[string]interface {}{}, "k:{\"port\":8443,\"protocol\":\"TCP\"}":map[string]interface {}{".":map[string]interface {}{}, "f:name":map[string]interface {}{}, "f:port":map[string]interface {}{}, "f:protocol":map[string]interface {}{}, "f:targetPort":map[string]interface {}{}}}, "f:selector":map[string]interface {}{}, "f:sessionAffinity":map[string]interface {}{}, "f:type":map[string]interface {}{}}}, "manager":"operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}, map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{"f:service.alpha.openshift.io/serving-cert-signed-by":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-signed-by":map[string]interface {}{}}}}, "manager":"service-ca-operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}}, "name":"cluster-monitoring-operator", "namespace":"openshift-monitoring", "resourceVersion":"6618", "uid":"f65e024e-baf4-4eb6-acce-13b913bcc13a"}, "spec":map[string]interface {}{"clusterIP":"None", "clusterIPs":[]interface {}{"None"}, "internalTrafficPolicy":"Cluster", "ipFamilies":[]interface {}{"IPv4"}, "ipFamilyPolicy":"SingleStack", "ports":[]interface {}{map[string]interface {}{"name":"https", "port":8443, "protocol":"TCP", "targetPort":"https"}}, "selector":map[string]interface {}{"app":"cluster-monitoring-operator"}, "sessionAffinity":"None", "type":"ClusterIP"}, "status":map[string]interface {}{"loadBalancer":map[string]interface {}{}}}}
# 2022-04-18T15:55:27+08:00 Sync 6618	/bin/sh -c "exit 2" openshift-monitoring cluster-monitoring-operator ""
I0418 15:55:27.046210 2404130 metric.go:86] retrying command: exit status 2
I0418 15:55:27.048102 2404130 metric.go:86] retrying command: exit status 2
error: command "/bin/sh" exited with status code 2
error: reached maximum error limit of 1, exiting
StdErr>
I0418 15:55:26.970635 2404130 observe.go:438] Listening on :11251 at /metrics and /healthz
I0418 15:55:26.970715 2404130 reflector.go:255] Listing and watching <unspecified> from observer
# 2022-04-18T15:55:27+08:00 Sync started
I0418 15:55:27.030943 2404130 observe.go:648] Processing Sync []: &unstructured.Unstructured{Object:map[string]interface {}{"apiVersion":"v1", "kind":"Service", "metadata":map[string]interface {}{"annotations":map[string]interface {}{"service.alpha.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354", "service.beta.openshift.io/serving-cert-secret-name":"cluster-monitoring-operator-tls", "service.beta.openshift.io/serving-cert-signed-by":"openshift-service-serving-signer@1650006354"}, "creationTimestamp":"2022-04-15T07:06:18Z", "labels":map[string]interface {}{"app":"cluster-monitoring-operator"}, "managedFields":[]interface {}{map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{".":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-secret-name":map[string]interface {}{}}, "f:labels":map[string]interface {}{".":map[string]interface {}{}, "f:app":map[string]interface {}{}}}, "f:spec":map[string]interface {}{"f:clusterIP":map[string]interface {}{}, "f:internalTrafficPolicy":map[string]interface {}{}, "f:ports":map[string]interface {}{".":map[string]interface {}{}, "k:{\"port\":8443,\"protocol\":\"TCP\"}":map[string]interface {}{".":map[string]interface {}{}, "f:name":map[string]interface {}{}, "f:port":map[string]interface {}{}, "f:protocol":map[string]interface {}{}, "f:targetPort":map[string]interface {}{}}}, "f:selector":map[string]interface {}{}, "f:sessionAffinity":map[string]interface {}{}, "f:type":map[string]interface {}{}}}, "manager":"operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}, map[string]interface {}{"apiVersion":"v1", "fieldsType":"FieldsV1", "fieldsV1":map[string]interface {}{"f:metadata":map[string]interface {}{"f:annotations":map[string]interface {}{"f:service.alpha.openshift.io/serving-cert-signed-by":map[string]interface {}{}, "f:service.beta.openshift.io/serving-cert-signed-by":map[string]interface {}{}}}}, "manager":"service-ca-operator", "operation":"Update", "time":"2022-04-15T07:06:18Z"}}, "name":"cluster-monitoring-operator", "namespace":"openshift-monitoring", "resourceVersion":"6618", "uid":"f65e024e-baf4-4eb6-acce-13b913bcc13a"}, "spec":map[string]interface {}{"clusterIP":"None", "clusterIPs":[]interface {}{"None"}, "internalTrafficPolicy":"Cluster", "ipFamilies":[]interface {}{"IPv4"}, "ipFamilyPolicy":"SingleStack", "ports":[]interface {}{map[string]interface {}{"name":"https", "port":8443, "protocol":"TCP", "targetPort":"https"}}, "selector":map[string]interface {}{"app":"cluster-monitoring-operator"}, "sessionAffinity":"None", "type":"ClusterIP"}, "status":map[string]interface {}{"loadBalancer":map[string]interface {}{}}}}
# 2022-04-18T15:55:27+08:00 Sync 6618	/bin/sh -c "exit 2" openshift-monitoring cluster-monitoring-operator ""
I0418 15:55:27.046210 2404130 metric.go:86] retrying command: exit status 2
I0418 15:55:27.048102 2404130 metric.go:86] retrying command: exit status 2
error: command "/bin/sh" exited with status code 2
error: reached maximum error limit of 1, exiting

[It] works as expected [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/cli/observe.go:17
STEP: basic scenarios
Apr 18 15:55:17.116: INFO: Running 'oc --kubeconfig=/root/.kube/config observe'
Apr 18 15:55:17.219: INFO: Error running /usr/bin/oc --kubeconfig=/root/.kube/config observe:
StdOut>
error: you must specify at least one argument containing the resource to observe
StdErr>
error: you must specify at least one argument containing the resource to observe

问题分析:

github.com/openshift/origin/test/extended/cli/observe.go:17

var _ = g.Describe("[sig-cli] oc observe", func() {
	defer g.GinkgoRecover()

	oc := exutil.NewCLIWithoutNamespace("oc-observe").AsAdmin()

	g.It("works as expected", func() {
		g.By("Find out the clusterIP of the kubernetes.default service")
		kubernetesSVC, err := oc.AdminKubeClient().CoreV1().Services("default").Get(context.Background(), "kubernetes", metav1.GetOptions{})
		o.Expect(err).NotTo(o.HaveOccurred())
		g.By("basic scenarios")
		out, err := oc.Run("observe").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("you must specify at least one argument containing the resource to observe"))

		out, err = oc.Run("observe").Args("serviceaccounts", "--once").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.Or(o.ContainSubstring("Sync ended"), o.ContainSubstring("Nothing to sync")))

		out, err = oc.Run("observe").Args("daemonsets", "--once").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.Or(o.ContainSubstring("Sync ended"), o.ContainSubstring("Nothing to sync, exiting immediately")))

		out, err = oc.Run("observe").Args("clusteroperators", "--once").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("kube-apiserver"))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("default kubernetes"))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--print-metrics-on-exit").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring(`observe_counts{type="Sync"}`))

		out, err = oc.Run("observe").Args("services", "--once", "--names", "echo").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("--delete and --names must both be specified"))

		out, err = oc.Run("observe").Args("services", "--exit-after=1s").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("Shutting down after 1s ..."))

		out, err = oc.Run("observe").Args("services", "--exit-after=3s", "--all-namespaces", "--print-metrics-on-exit").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring(`observe_counts{type="Sync"}`))

		out, err = oc.Run("observe").Args("services", "--exit-after=3s", "--all-namespaces", "--names", "echo", "--names", "default/notfound", "--delete", "echo", "--delete", "remove").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("remove default notfound"))

		g.By("error counting")
		out, err = oc.Run("observe").Args("services", "--exit-after=1m", "--all-namespaces", "--maximum-errors=1", "--", "/bin/sh", "-c", "exit 1").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("reached maximum error limit of 1, exiting"))

		out, err = oc.Run("observe").Args("services", "--exit-after=1m", "--all-namespaces", "--retry-on-exit-code=2", "--maximum-errors=1", "--loglevel=4", "--", "/bin/sh", "-c", "exit 2").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("retrying command: exit status 2"))

		g.By("argument templates")
		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--template='{ .spec.clusterIP }'").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.Or(o.ContainSubstring(kubernetesSVC.Spec.ClusterIP), o.ContainSubstring("fd02::1")))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--template='{{ .spec.clusterIP }}'", "--output=go-template").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.Or(o.ContainSubstring(kubernetesSVC.Spec.ClusterIP), o.ContainSubstring("fd02::1")))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--template='bad{ .missingkey }key'").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("badkey"))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--template='bad{ .missingkey }key'", "--allow-missing-template-keys=false").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("missingkey is not found"))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--template='{{ .unknown }}'", "--output=go-template").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("default kubernetes"))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", `--template='bad{{ or (.unknown) "" }}key'`, "--output=go-template").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("badkey"))

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--template='bad{{ .unknown }}key'", "--output=go-template", "--allow-missing-template-keys=false").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("map has no entry for key"))

		g.By("event environment variables")
		o.Expect(os.Setenv("MYENV", "should_be_passed")).NotTo(o.HaveOccurred())
		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--type-env-var=EVENT", "--", "/bin/sh", "-c", "echo $EVENT $MYENV").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("Sync should_be_passed"))
		o.Expect(os.Unsetenv("MYENV")).NotTo(o.HaveOccurred())
	})
})

// NewCLIWithoutNamespace initializes the CLI and Kube framework helpers
// without a namespace. Should be called outside of a Ginkgo .It()
// function. Use SetupProject() to create a project for this namespace.
func NewCLIWithoutNamespace(project string) *CLI {
	cli := &CLI{
		kubeFramework: &framework.Framework{
			SkipNamespaceCreation:    true,
			BaseName:                 project,
			AddonResourceConstraints: make(map[string]framework.ResourceConstraint),
			Options: framework.Options{
				ClientQPS:   20,
				ClientBurst: 50,
			},
			Timeouts: framework.NewTimeoutContextWithDefaults(),
		},
		username:         "admin",
		execPath:         "oc",
		adminConfigPath:  KubeConfigPath(),
		withoutNamespace: true,
	}
	g.AfterEach(cli.TeardownProject)
	g.AfterEach(cli.kubeFramework.AfterEach)
	g.BeforeEach(cli.kubeFramework.BeforeEach)
	return cli
}

具体出问题的命令:

/usr/bin/oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --maximum-errors=1 -- /bin/sh -c exit 1

/bin/sh -c "exit 1" openshift-kube-controller-manager kube-controller-manager ""

在执行以下命令时候存在耗时:

# 2022-04-18T19:49:23+08:00 Sync 10831	/bin/sh -c exit 1 openshift-cluster-samples-operator metrics ""

[root@master0 zsl]# time   /usr/bin/oc --kubeconfig=/root/.kube/config observe services --exit-after=1m --all-namespaces --maximum-errors=1 -- /bin/sh -c exit 1
# 2022-04-18T19:49:23+08:00 Sync started
# 2022-04-18T19:49:23+08:00 Sync 6016	/bin/sh -c exit 1 openshift-dns-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 5522	/bin/sh -c exit 1 openshift-machine-config-operator machine-config-daemon ""
# 2022-04-18T19:49:23+08:00 Sync 6374	/bin/sh -c exit 1 openshift-oauth-apiserver api ""
# 2022-04-18T19:49:23+08:00 Sync 5111	/bin/sh -c exit 1 openshift-apiserver check-endpoints ""
# 2022-04-18T19:49:23+08:00 Sync 1676541	/bin/sh -c exit 1 e2e-test-htpasswd-idp-fnfqp test-oauth-svc ""
# 2022-04-18T19:49:23+08:00 Sync 15564	/bin/sh -c exit 1 openshift-ingress-canary ingress-canary ""
# 2022-04-18T19:49:23+08:00 Sync 5805	/bin/sh -c exit 1 openshift-kube-storage-version-migrator-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 7206	/bin/sh -c exit 1 openshift-marketplace community-operators ""
# 2022-04-18T19:49:23+08:00 Sync 2850	/bin/sh -c exit 1 openshift-network-diagnostics network-check-target ""
# 2022-04-18T19:49:23+08:00 Sync 5670	/bin/sh -c exit 1 openshift-machine-api machine-api-controllers ""
# 2022-04-18T19:49:23+08:00 Sync 5797	/bin/sh -c exit 1 openshift-machine-api machine-api-operator ""
# 2022-04-18T19:49:23+08:00 Sync 6618	/bin/sh -c exit 1 openshift-monitoring cluster-monitoring-operator ""
# 2022-04-18T19:49:23+08:00 Sync 5770	/bin/sh -c exit 1 openshift-cluster-storage-operator csi-snapshot-webhook ""
# 2022-04-18T19:49:23+08:00 Sync 6680	/bin/sh -c exit 1 openshift-monitoring prometheus-adapter ""
# 2022-04-18T19:49:23+08:00 Sync 6242	/bin/sh -c exit 1 openshift-cluster-storage-operator cluster-storage-operator-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6206	/bin/sh -c exit 1 openshift-kube-controller-manager-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6606	/bin/sh -c exit 1 openshift-monitoring thanos-querier ""
# 2022-04-18T19:49:23+08:00 Sync 1668667	/bin/sh -c exit 1 e2e-test-build-service-6sk4h hello-nodejs ""
# 2022-04-18T19:49:23+08:00 Sync 5705	/bin/sh -c exit 1 openshift-apiserver-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6369	/bin/sh -c exit 1 openshift-authentication oauth-openshift ""
# 2022-04-18T19:49:23+08:00 Sync 5904	/bin/sh -c exit 1 openshift-image-registry image-registry-operator ""
# 2022-04-18T19:49:23+08:00 Sync 8367	/bin/sh -c exit 1 openshift-operator-lifecycle-manager packageserver-service ""
# 2022-04-18T19:49:23+08:00 Sync 6311	/bin/sh -c exit 1 openshift-operator-lifecycle-manager olm-operator-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6304	/bin/sh -c exit 1 openshift-controller-manager-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6101	/bin/sh -c exit 1 openshift-kube-scheduler scheduler ""
# 2022-04-18T19:49:23+08:00 Sync 6689	/bin/sh -c exit 1 openshift-monitoring telemeter-client ""
# 2022-04-18T19:49:23+08:00 Sync 6324	/bin/sh -c exit 1 openshift-multus multus-admission-controller ""
# 2022-04-18T19:49:23+08:00 Sync 6086	/bin/sh -c exit 1 openshift-operator-lifecycle-manager catalog-operator-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 165547	/bin/sh -c exit 1 e2e-statefulset-5018 test ""
# 2022-04-18T19:49:23+08:00 Sync 6095	/bin/sh -c exit 1 openshift-ovn-kubernetes ovn-kubernetes-node ""
# 2022-04-18T19:49:23+08:00 Sync 1672696	/bin/sh -c exit 1 e2e-test-oauth-server-headers-xvj6k test-oauth-svc ""
# 2022-04-18T19:49:23+08:00 Sync 6223	/bin/sh -c exit 1 openshift-config-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 1672264	/bin/sh -c exit 1 e2e-test-router-scoped-b67q2 endpoints ""
# 2022-04-18T19:49:23+08:00 Sync 1673608	/bin/sh -c exit 1 e2e-test-router-scoped-ns2x9 endpoints ""
# 2022-04-18T19:49:23+08:00 Sync 7149	/bin/sh -c exit 1 openshift-marketplace redhat-operators ""
# 2022-04-18T19:49:23+08:00 Sync 17111	/bin/sh -c exit 1 openshift-monitoring prometheus-k8s ""
# 2022-04-18T19:49:23+08:00 Sync 231	/bin/sh -c exit 1 default kubernetes ""
# 2022-04-18T19:49:23+08:00 Sync 5990	/bin/sh -c exit 1 openshift-cluster-node-tuning-operator node-tuning-operator ""
# 2022-04-18T19:49:23+08:00 Sync 18058	/bin/sh -c exit 1 openshift-monitoring alertmanager-operated ""
# 2022-04-18T19:49:23+08:00 Sync 6111	/bin/sh -c exit 1 openshift-cluster-version cluster-version-operator ""
# 2022-04-18T19:49:23+08:00 Sync 17104	/bin/sh -c exit 1 openshift-monitoring prometheus-k8s-thanos-sidecar ""
# 2022-04-18T19:49:23+08:00 Sync 5681	/bin/sh -c exit 1 openshift-cluster-storage-operator csi-snapshot-controller-operator-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6609	/bin/sh -c exit 1 openshift-monitoring openshift-state-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 18011	/bin/sh -c exit 1 openshift-monitoring prometheus-operated ""
# 2022-04-18T19:49:23+08:00 Sync 1679606	/bin/sh -c exit 1 e2e-test-cli-idling-rwmgr idling-echo ""
# 2022-04-18T19:49:23+08:00 Sync 5983	/bin/sh -c exit 1 openshift-etcd etcd ""
# 2022-04-18T19:49:23+08:00 Sync 5690	/bin/sh -c exit 1 openshift-monitoring prometheus-operator ""
# 2022-04-18T19:49:23+08:00 Sync 1675290	/bin/sh -c exit 1 e2e-net-services1-3818 service-m62z8 ""
# 2022-04-18T19:49:23+08:00 Sync 1677803	/bin/sh -c exit 1 e2e-test-router-idling-97smp idle-test ""
# 2022-04-18T19:49:23+08:00 Sync 5878	/bin/sh -c exit 1 openshift-ingress-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 5513	/bin/sh -c exit 1 openshift-insights metrics ""
# 2022-04-18T19:49:23+08:00 Sync 1675143	/bin/sh -c exit 1 e2e-test-router-scoped-2rq7v endpoints ""
# 2022-04-18T19:49:23+08:00 Sync 1678652	/bin/sh -c exit 1 e2e-test-oauth-server-headers-4lr8f test-oauth-svc ""
# 2022-04-18T19:49:23+08:00 Sync 5896	/bin/sh -c exit 1 openshift-machine-api machine-api-operator-webhook ""
# 2022-04-18T19:49:23+08:00 Sync 19724	/bin/sh -c exit 1 openshift-console-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 57687	/bin/sh -c exit 1 openshift-image-registry image-registry ""
# 2022-04-18T19:49:23+08:00 Sync 5715	/bin/sh -c exit 1 openshift-cluster-machine-approver machine-approver ""
# 2022-04-18T19:49:23+08:00 Sync 6898	/bin/sh -c exit 1 openshift-dns dns-default ""
# 2022-04-18T19:49:23+08:00 Sync 5912	/bin/sh -c exit 1 openshift-machine-api cluster-baremetal-webhook-service ""
# 2022-04-18T19:49:23+08:00 Sync 6566	/bin/sh -c exit 1 openshift-monitoring kube-state-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 5569	/bin/sh -c exit 1 openshift-kube-apiserver-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6260	/bin/sh -c exit 1 openshift-machine-api cluster-autoscaler-operator ""
# 2022-04-18T19:49:23+08:00 Sync 7143	/bin/sh -c exit 1 openshift-marketplace redhat-marketplace ""
# 2022-04-18T19:49:23+08:00 Sync 6583	/bin/sh -c exit 1 openshift-monitoring node-exporter ""
# 2022-04-18T19:49:23+08:00 Sync 19815	/bin/sh -c exit 1 openshift-console console ""
# 2022-04-18T19:49:23+08:00 Sync 6335	/bin/sh -c exit 1 openshift-controller-manager controller-manager ""
# 2022-04-18T19:49:23+08:00 Sync 5832	/bin/sh -c exit 1 openshift-etcd-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6001	/bin/sh -c exit 1 openshift-kube-scheduler-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 6353	/bin/sh -c exit 1 openshift-apiserver api ""
# 2022-04-18T19:49:23+08:00 Sync 2696	/bin/sh -c exit 1 openshift-ovn-kubernetes ovnkube-db ""
# 2022-04-18T19:49:23+08:00 Sync 5552	/bin/sh -c exit 1 openshift-cloud-credential-operator cco-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 5525	/bin/sh -c exit 1 openshift-marketplace marketplace-operator-metrics ""
# 2022-04-18T19:49:23+08:00 Sync 5646	/bin/sh -c exit 1 openshift-kube-apiserver apiserver ""
# 2022-04-18T19:49:23+08:00 Sync 6076	/bin/sh -c exit 1 openshift-machine-api cluster-baremetal-operator-service ""
# 2022-04-18T19:49:23+08:00 Sync 5921	/bin/sh -c exit 1 openshift-ovn-kubernetes ovn-kubernetes-master ""
# 2022-04-18T19:49:23+08:00 Sync 17107	/bin/sh -c exit 1 openshift-monitoring alertmanager-main ""
# 2022-04-18T19:49:23+08:00 Sync 1679455	/bin/sh -c exit 1 e2e-test-oauth-server-headers-nvhcd test-oauth-svc ""
# 2022-04-18T19:49:23+08:00 Sync 5823	/bin/sh -c exit 1 openshift-service-ca-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 1679526	/bin/sh -c exit 1 e2e-test-new-app-bqgn5 a234567890123456789012345678901234567890123456789012345678 ""
# 2022-04-18T19:49:23+08:00 Sync 6047	/bin/sh -c exit 1 kube-system kubelet ""
# 2022-04-18T19:49:23+08:00 Sync 19835	/bin/sh -c exit 1 openshift-console downloads ""
# 2022-04-18T19:49:23+08:00 Sync 7410	/bin/sh -c exit 1 openshift-ingress router-internal-default ""
# 2022-04-18T19:49:23+08:00 Sync 5965	/bin/sh -c exit 1 openshift-kube-controller-manager kube-controller-manager ""
# 2022-04-18T19:49:23+08:00 Sync 1676035	/bin/sh -c exit 1 e2e-test-oauth-ldap-idp-pvm4z test-oauth-svc ""
# 2022-04-18T19:49:23+08:00 Sync 1668918	/bin/sh -c exit 1 e2e-test-oauth-ldap-idp-pvm4z openldap-server ""
# 2022-04-18T19:49:23+08:00 Sync 1678095	/bin/sh -c exit 1 e2e-test-oauth-expiration-vlms9 test-oauth-svc ""
# 2022-04-18T19:49:23+08:00 Sync 6872	/bin/sh -c exit 1 default openshift ""
# 2022-04-18T19:49:23+08:00 Sync 6269	/bin/sh -c exit 1 openshift-authentication-operator metrics ""
# 2022-04-18T19:49:23+08:00 Sync 7179	/bin/sh -c exit 1 openshift-marketplace certified-operators ""
# 2022-04-18T19:49:23+08:00 Sync 17109	/bin/sh -c exit 1 openshift-monitoring grafana ""
# 2022-04-18T19:49:23+08:00 Sync 5517	/bin/sh -c exit 1 openshift-multus network-metrics-service ""
# 2022-04-18T19:49:23+08:00 Sync 1678699	/bin/sh -c exit 1 e2e-test-oauth-server-headers-cb6ws test-oauth-svc ""
# 2022-04-18T19:49:23+08:00 Sync 2827	/bin/sh -c exit 1 openshift-network-diagnostics network-check-source ""
# 2022-04-18T19:49:23+08:00 Sync 10831	/bin/sh -c exit 1 openshift-cluster-samples-operator metrics ""
Shutting down after 1m0s ...

real	1m0.067s
user	0m0.174s
sys	0m0.099s

		out, err = oc.Run("observe").Args("services", "--once", "--all-namespaces", "--print-metrics-on-exit").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring(`observe_counts{type="Sync"}`))

		out, err = oc.Run("observe").Args("services", "--once", "--names", "echo").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("--delete and --names must both be specified"))

		out, err = oc.Run("observe").Args("services", "--exit-after=1s").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("Shutting down after 1s ..."))

		out, err = oc.Run("observe").Args("services", "--exit-after=3s", "--all-namespaces", "--print-metrics-on-exit").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring(`observe_counts{type="Sync"}`))

		out, err = oc.Run("observe").Args("services", "--exit-after=3s", "--all-namespaces", "--names", "echo", "--names", "default/notfound", "--delete", "echo", "--delete", "remove").Output()
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("remove default notfound"))

		g.By("error counting")
		out, err = oc.Run("observe").Args("services", "--exit-after=1m", "--all-namespaces", "--maximum-errors=1", "--", "/bin/sh", "-c", "exit 1").Output()
		o.Expect(err).To(o.HaveOccurred())
		o.Expect(out).To(o.ContainSubstring("reached maximum error limit of 1, exiting"))

问题解决:

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/184315