当前位置:   article > 正文

基于k8s+Prometheus+Alertmanager+Grafana构建企业级监控告警系统

prometheus高集群部署

特别提醒:

下文实验需要的yaml文件和压缩包可加我微信获取

微信: luckylucky421302


1.1 深度解读Prometheus

1.1.1  什么是Prometheus?

Prometheus是一个开源的系统监控和报警系统,现在已经加入到CNCF基金会,成为继k8s之后第二个在CNCF托管的项目,在kubernetes容器管理系统中,通常会搭配prometheus进行监控,同时也支持多种exporter采集数据,还支持pushgateway进行数据上报,Prometheus性能足够支撑上万台规模的集群。

 

1.1.2  prometheus特点

1.多维度数据模型

时间序列数据由metrics名称和键值对来组成

可以对数据进行聚合,切割等操作

所有的metrics都可以设置任意的多维标签。

2.灵活的查询语言(PromQL)

可以对采集的metrics指标进行加法,乘法,连接等操作;

3.可以直接在本地部署,不依赖其他分布式存储;

4.通过基于HTTP的pull方式采集时序数据;

5.可以通过中间网关pushgateway的方式把时间序列数据推送到prometheus server端;

6.可通过服务发现或者静态配置来发现目标服务对象(targets)。

7.有多种可视化图像界面,如Grafana等。

8.高效的存储,每个采样数据占3.5 bytes左右,300万的时间序列,30s间隔,保留60天,消耗磁盘大概200G。

9.做高可用,可以对数据做异地备份,联邦集群,部署多套prometheus,pushgateway上报数据

 

1.1.3  prometheus组件

从上图可发现,Prometheus整个生态圈组成主要包括prometheus server,Exporter,pushgateway,alertmanager,grafana,Web ui界面,Prometheusserver由三个部分组成,Retrieval,Storage,PromQL

1.Retrieval负责在活跃的target主机上抓取监控指标数据

2.Storage存储主要是把采集到的数据存储到磁盘中

3.PromQL是Prometheus提供的查询语言模块。

 

1.PrometheusServer:

用于收集和存储时间序列数据。

2.ClientLibrary:

客户端库,检测应用程序代码,当Prometheus抓取实例的HTTP端点时,客户端库会将所有跟踪的metrics指标的当前状态发送到prometheus server端。

3.Exporters:

prometheus支持多种exporter,通过exporter可以采集metrics数据,然后发送到prometheus server端,所有向promtheus server提供监控数据的程序都可以被称为exporter

4.Alertmanager:

从 Prometheusserver 端接收到 alerts 后,会进行去重,分组,并路由到相应的接收方,发出报警,常见的接收方式有:电子邮件,微信,钉钉, slack等。

5.Grafana:

监控仪表盘,可视化监控数据

6.pushgateway:

各个目标主机可上报数据到pushgatewy,然后prometheus server统一从pushgateway拉取数据。

1.1.4  prometheus几种部署模式

基本HA模式

基本的HA模式只能确保Promthues服务的可用性问题,但是不解决Prometheus Server之间的数据一致性问题以及持久化问题(数据丢失后无法恢复),也无法进行动态的扩展。因此这种部署方式适合监控规模不大,Promthues Server也不会频繁发生迁移的情况,并且只需要保存短周期监控数据的场景。

基本HA + 远程存储方案

在解决了Promthues服务可用性的基础上,同时确保了数据的持久化,当Promthues Server发生宕机或者数据丢失的情况下,可以快速的恢复。同时PromthuesServer可能很好的进行迁移。因此,该方案适用于用户监控规模不大,但是希望能够将监控数据持久化,同时能够确保PromthuesServer的可迁移性的场景。

 

基本HA + 远程存储 + 联邦集群方案

  1. Promthues的性能瓶颈主要在于大量的采集任务,因此用户需要利用Prometheus联邦集群的特性,将不同类型的采集任务划分到不同的Promthues子服务中,从而实现功能分区。例如一个Promthues Server负责采集基础设施相关的监控指标,另外一个Prometheus Server负责采集应用监控指标。再有上层Prometheus Server实现对数据的汇聚。
  2. 1.1.5 prometheus工作流程
  3. 1. Prometheus server可定期从活跃的(up)目标主机上(target)拉取监控指标数据,目标主机的监控数据可通过配置静态job或者服务发现的方式被prometheus server采集到,这种方式默认的pull方式拉取指标;也可通过pushgateway把采集的数据上报到prometheus server中;还可通过一些组件自带的exporter采集相应组件的数据;
  4. 2.Prometheus server把采集到的监控指标数据保存到本地磁盘或者数据库;
  5. 3.Prometheus采集的监控指标数据按时间序列存储,通过配置报警规则,把触发的报警发送到alertmanager
  6. 4.Alertmanager通过配置报警接收方,发送报警到邮件,微信或者钉钉等
  7. 5.Prometheus 自带的web ui界面提供PromQL查询语言,可查询监控数据
  8. 6.Grafana可接入prometheus数据源,把监控数据以图形化形式展示出
  9. 1.1.6 prometheus如何更好的监控k8s?
  10. 对于Kubernetes而言,我们可以把当中所有的资源分为几类:
  11. 1、基础设施层(Node):集群节点,为整个集群和应用提供运行时资源
  12. 2、容器基础设施(Container):为应用提供运行时环境
  13. 3、用户应用(Pod):Pod中会包含一组容器,它们一起工作,并且对外提供一个(或者一组)功能
  14. 4、内部服务负载均衡(Service):在集群内,通过Service在集群暴露应用功能,集群内应用和应用之间访问时提供内部的负载均衡
  15. 5、外部访问入口(Ingress):通过Ingress提供集群外的访问入口,从而可以使外部客户端能够访问到部署在Kubernetes集群内的服务
  16. 因此,在不考虑Kubernetes自身组件的情况下,如果要构建一个完整的监控体系,我们应该考虑,以下5个方面:
  17. 1、集群节点状态监控:从集群中各节点的kubelet服务获取节点的基本运行状态;
  18. 2、集群节点资源用量监控:通过Daemonset的形式在集群中各个节点部署Node
  19. Exporter采集节点的资源使用情况;
  20. 3、节点中运行的容器监控:通过各个节点中kubelet内置的cAdvisor中获取个节点中所有容器的运行状态和资源使用情况;
  21. 4、从黑盒监控的角度在集群中部署Blackbox Exporter探针服务,检测Service和Ingress的可用性;
  22. 5、如果在集群中部署的应用程序本身内置了对Prometheus的监控支持,那么我们还应该找到相应的Pod实例,并从该Pod实例中获取其内部运行状态的监控指标。

  1. 1.2 安装采集节点资源指标组件node-exporter
  2. node-exporter是什么?
  3. 采集机器(物理机、虚拟机、云主机等)的监控指标数据,能够采集到的指标包括CPU, 内存,磁盘,网络,文件数等信息。
  4. 安装node-exporter组件,在k8s集群的控制节点操作
  5. [root@master1 ~]# kubectl create ns monitor-sa
  6. namespace/monitor-sa created
  7. 把课件里的node-exporter.tar.gz镜像压缩包上传到k8s的各个节点,手动解压:
  8. docker load -i node-exporter.tar.gz
  9. node-export.yaml文件在课件,可自行上传到自己k8s的控制节点,内容如下:
  10. [root@master1 ~]# cat node-export.yaml 
  11. #通过kubectl apply更新node-exporter
  12. [root@master1 ~]# kubectl apply -f node-export.yaml
  13. daemonset.apps/node-exporter created
  14. #查看node-exporter是否部署成功
  15. [root@master1 ~]# kubectl get pods -n monitor-sa
  16. NAME READY STATUS RESTARTS AGE
  17. node-exporter-7cjhw 1/1 Running 0 22s
  18. node-exporter-8m2fp 1/1 Running 0 22s
  19. node-exporter-c6sdq 1/1 Running 0 22s
  20. 通过node-exporter采集数据
  21. curl http://主机ip:9100/metrics
  22. #node-export默认的监听端口是9100,可以看到当前主机获取到的所有监控数据
  23. curl http://192.168.40.130:9100/metrics | grep node_cpu_seconds
  24. 显示192.168.40.130主机cpu的使用情况
  25. # HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
  26. # TYPE node_cpu_seconds_total counter
  27. node_cpu_seconds_total{cpu="0",mode="idle"} 72963.37
  28. node_cpu_seconds_total{cpu="0",mode="iowait"} 9.35
  29. node_cpu_seconds_total{cpu="0",mode="irq"} 0
  30. node_cpu_seconds_total{cpu="0",mode="nice"} 0
  31. node_cpu_seconds_total{cpu="0",mode="softirq"} 151.4
  32. node_cpu_seconds_total{cpu="0",mode="steal"} 0
  33. node_cpu_seconds_total{cpu="0",mode="system"} 656.12
  34. node_cpu_seconds_total{cpu="0",mode="user"} 267.1
  35. #HELP:解释当前指标的含义,上面表示在每种模式下node节点的cpu花费的时间,以s为单位
  36. #TYPE:说明当前指标的数据类型,上面是counter类型
  37. node_cpu_seconds_total{cpu="0",mode="idle"} :
  38. cpu0上idle进程占用CPU的总时间,CPU占用时间是一个只增不减的度量指标,从类型中也可以看出node_cpu的数据类型是counter(计数器)
  39. counter计数器:只是采集递增的指标
  40. curl http://192.168.40.130:9100/metrics | grep node_load
  41. # HELP node_load1 1m load average.
  42. # TYPE node_load1 gauge
  43. node_load1 0.1
  44. node_load1该指标反映了当前主机在最近一分钟以内的负载情况,系统的负载情况会随系统资源的使用而变化,因此node_load1反映的是当前状态,数据可能增加也可能减少,从注释中可以看出当前指标类型为gauge(标准尺寸)
  45. gauge标准尺寸:统计的指标可增加可减少
  46. 1.3 在k8s集群中安装Prometheus server服务
  47. 1.3.1 创建sa账号
  48. #在k8s集群的控制节点操作,创建一个sa账号
  49. [root@master1 ~]# kubectl create serviceaccount monitor -n monitor-sa
  50. serviceaccount/monitor created
  51. #把sa账号monitor通过clusterrolebing绑定到clusterrole上
  52. [root@master1 ~]# kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor-sa --clusterrole=cluster-admin --serviceaccount=monitor-sa:monitor
  53. 1.3.2 创建数据目录
  54. #在node1作节点创建存储数据的目录:
  55. [root@node1 ~]# mkdir /data
  56. [root@node1 ~]# chmod 777 /data/
  57. 1.3.3 安装prometheus服务
  58. 以下步骤均在k8s集群的控制节点操作:
  59. 创建一个configmap存储卷,用来存放prometheus配置信息
  60. prometheus-cfg.yaml文件在课件,可自行上传到自己k8s的控制节点,内容如下:
  61. [root@master1 ~]# cat  prometheus-cfg.yaml
  62. #通过kubectl apply更新configmap
  63. [root@master1 ~]# kubectl apply -f prometheus-cfg.yaml
  64. configmap/prometheus-config created
  65. 通过deployment部署prometheus
  66. 安装prometheus server需要的镜像prometheus-2-2-1.tar.gz在课件,上传到k8s的工作节点node1上,手动解压:
  67. docker load -i prometheus-2-2-1.tar.gz
  68. prometheus-deploy.yaml文件在课件,上传到自己的k8s的控制节点,内容如下:
  69. [root@master1 ~]# cat prometheus-deploy.yaml 
  70. 注意:在上面的prometheus-deploy.yaml文件有个nodeName字段,这个就是用来指定创建的这个prometheus的pod调度到哪个节点上,我们这里让nodeName=node1,也即是让pod调度到node1节点上,因为node1节点我们创建了数据目录/data,所以大家记住:你在k8s集群的哪个节点创建/data,就让pod调度到哪个节点。
  71. #通过kubectl apply更新prometheus
  72. [root@master1 ~]# kubectl apply -f prometheus-deploy.yaml
  73. deployment.apps/prometheus-server created
  74. #查看prometheus是否部署成功
  75. [root@master1 ~]# kubectl get pods -n monitor-sa
  76. NAME READY STATUS RESTARTS AGE
  77. node-exporter-7cjhw 1/1 Running 0 6m33s
  78. node-exporter-8m2fp 1/1 Running 0 6m33s
  79. node-exporter-c6sdq 1/1 Running 0 6m33s
  80. prometheus-server-6fffccc6c9-bhbpz 1/1 Running 0 26s
  81. 给prometheus pod创建一个service
  82. prometheus-svc.yaml文件在课件,可上传到k8s的控制节点,内容如下:
  83. cat prometheus-svc.yaml
  84. ---
  85. apiVersion: v1
  86. kind: Service
  87. metadata:
  88. name: prometheus
  89. namespace: monitor-sa
  90. labels:
  91. app: prometheus
  92. spec:
  93. type: NodePort
  94. ports:
  95. - port: 9090
  96. targetPort: 9090
  97. protocol: TCP
  98. selector:
  99. app: prometheus
  100. component: server
  101. #通过kubectl apply 更新service
  102. [root@master1 ~]# kubectl apply -f prometheus-svc.yaml
  103. service/prometheus created
  104. #查看service在物理机映射的端口
  105. [root@master1 ~]# kubectl get svc -n monitor-sa
  106. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  107. prometheus NodePort 10.103.98.225 <none> 9090:30009/TCP 27s
  108. 通过上面可以看到service在宿主机上映射的端口是30009,这样我们访问k8s集群的控制节点的ip:30009,就可以访问到prometheus的web ui界面了
  109. #访问prometheus web ui界面
  110. 火狐浏览器输入如下地址:
  111. http://192.168.40.130:30009/graph
  112. 可看到如下页面:

  1. 1.4 安装和配置可视化UI界面Grafana
  2. 安装Grafana需要的镜像heapster-grafana-amd64_v5_0_4.tar.gz在课件里,把镜像上传到k8s的各个控制节点和k8s的各个工作节点,然后在各个节点手动解压:
  3. docker load -i heapster-grafana-amd64_v5_0_4.tar.gz
  4. grafana.yaml文件在课件里,可上传到k8s的控制节点,内容如下:
  5. [root@master1 ~]# cat grafana.yaml
  6. apiVersion: apps/v1
  7. kind: Deployment
  8. metadata:
  9. name: monitoring-grafana
  10. namespace: kube-system
  11. spec:
  12. replicas: 1
  13. selector:
  14. matchLabels:
  15. task: monitoring
  16. k8s-app: grafana
  17. template:
  18. metadata:
  19. labels:
  20. task: monitoring
  21. k8s-app: grafana
  22. spec:
  23. containers:
  24. - name: grafana
  25. image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4
  26. ports:
  27. - containerPort: 3000
  28. protocol: TCP
  29. volumeMounts:
  30. - mountPath: /etc/ssl/certs
  31. name: ca-certificates
  32. readOnly: true
  33. - mountPath: /var
  34. name: grafana-storage
  35. env:
  36. - name: INFLUXDB_HOST
  37. value: monitoring-influxdb
  38. - name: GF_SERVER_HTTP_PORT
  39. value: "3000"
  40. # The following env variables are required to make Grafana accessible via
  41. # the kubernetes api-server proxy. On production clusters, we recommend
  42. # removing these env variables, setup auth for grafana, and expose the grafana
  43. # service using a LoadBalancer or a public IP.
  44. - name: GF_AUTH_BASIC_ENABLED
  45. value: "false"
  46. - name: GF_AUTH_ANONYMOUS_ENABLED
  47. value: "true"
  48. - name: GF_AUTH_ANONYMOUS_ORG_ROLE
  49. value: Admin
  50. - name: GF_SERVER_ROOT_URL
  51. # If you're only using the API Server proxy, set this value instead:
  52. # value: /api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
  53. value: /
  54. volumes:
  55. - name: ca-certificates
  56. hostPath:
  57. path: /etc/ssl/certs
  58. - name: grafana-storage
  59. emptyDir: {}
  60. ---
  61. apiVersion: v1
  62. kind: Service
  63. metadata:
  64. labels:
  65. # For use as a Cluster add-on (https://github.com/kubernetes/kubernetes/tree/master/cluster/addons)
  66. # If you are NOT using this as an addon, you should comment out this line.
  67. kubernetes.io/cluster-service: 'true'
  68. kubernetes.io/name: monitoring-grafana
  69. name: monitoring-grafana
  70. namespace: kube-system
  71. spec:
  72. # In a production setup, we recommend accessing Grafana through an external Loadbalancer
  73. # or through a public IP.
  74. # type: LoadBalancer
  75. # You could also use NodePort to expose the service at a randomly-generated port
  76. # type: NodePort
  77. ports:
  78. - port: 80
  79. targetPort: 3000
  80. selector:
  81. k8s-app: grafana
  82. type: NodePort
  83. #更新yaml文件
  84. [root@master1 ~]# kubectl apply -f grafana.yaml
  85. deployment.apps/monitoring-grafana created
  86. service/monitoring-grafana created
  87. #验证是否安装成功
  88. [root@master1 ~]# kubectl get pods -n kube-system| grep monitor
  89. monitoring-grafana-675798bf47-4rp2b 1/1 Running 0
  90. #查看grafana前端的service
  91. [root@master1 ~]# kubectl get svc -n kube-system | grep grafana
  92. monitoring-grafana NodePort 10.100.56.76 <none> 80:30989/TCP
  93. #登陆grafana,在浏览器访问
  94. 192.168.40.130:30989
  95. 可看到如下界面:

#配置grafana界面
开始配置grafana的web界面:
选择Create your first data source

出现如下

Name:Prometheus 

Type:Prometheus

HTTP 处的URL如下:

http://prometheus.monitor-sa.svc:9090

配置好的整体页面如下:

点击左下角Save& Test,出现如下Data source is working,说明prometheus数据源成功的被grafana接入了:

导入监控模板,可在如下链接搜索
https://grafana.com/dashboards?dataSource=prometheus&search=kubernetes

 

可直接导入node_exporter.json监控模板,这个可以把node节点指标显示出来

node_exporter.json在课件里,也可直接导入docker_rev1.json,这个可以把容器资源指标显示出来,node_exporter.json和docker_rev1.json都在课件里

怎么导入监控模板,按如下步骤

上面Save& Test测试没问题之后,就可以返回Grafana主页面

点击左侧+号下面的Import

出现如下界面:

选择Upload json file,出现如下

选择一个本地的json文件,我们选择的是上面让大家下载的node_exporter.json这个文件,选择之后出现如下:

注:箭头标注的地方Name后面的名字是node_exporter.json定义的

Prometheus后面需要变成Prometheus,然后再点击Import,就可以出现如下界面:

导入docker_rev1.json监控模板,步骤和上面导入node_exporter.json步骤一样,导入之后显示如下:

  1. 1.5 kube-state-metrics组件解读
  2. 1.5.1 什么是kube-state-metrics?
  3. kube-state-metrics通过监听API Server生成有关资源对象的状态指标,比如Deployment、Node、Pod,需要注意的是kube-state-metrics只是简单的提供一个metrics数据,并不会存储这些指标数据,所以我们可以使用Prometheus来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如Deployment、Pod、副本状态等;调度了多少个replicas?现在可用的有几个?多少个Pod是running/stopped/terminated状态?Pod重启了多少次?我有多少job在运行中。
  4. 1.5.2 安装和配置kube-state-metrics
  5. 创建sa,并对sa授权
  6. 在k8s的控制节点生成一个kube-state-metrics-rbac.yaml文件,kube-state-metrics-rbac.yaml文件在课件,大家自行下载到k8s的控制节点即可,内容如下:
  7. [root@master1 ~]# cat kube-state-metrics-rbac.yaml
  8. ---
  9. apiVersion: v1
  10. kind: ServiceAccount
  11. metadata:
  12. name: kube-state-metrics
  13. namespace: kube-system
  14. ---
  15. apiVersion: rbac.authorization.k8s.io/v1
  16. kind: ClusterRole
  17. metadata:
  18. name: kube-state-metrics
  19. rules:
  20. - apiGroups: [""]
  21. resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
  22. verbs: ["list", "watch"]
  23. - apiGroups: ["extensions"]
  24. resources: ["daemonsets", "deployments", "replicasets"]
  25. verbs: ["list", "watch"]
  26. - apiGroups: ["apps"]
  27. resources: ["statefulsets"]
  28. verbs: ["list", "watch"]
  29. - apiGroups: ["batch"]
  30. resources: ["cronjobs", "jobs"]
  31. verbs: ["list", "watch"]
  32. - apiGroups: ["autoscaling"]
  33. resources: ["horizontalpodautoscalers"]
  34. verbs: ["list", "watch"]
  35. ---
  36. apiVersion: rbac.authorization.k8s.io/v1
  37. kind: ClusterRoleBinding
  38. metadata:
  39. name: kube-state-metrics
  40. roleRef:
  41. apiGroup: rbac.authorization.k8s.io
  42. kind: ClusterRole
  43. name: kube-state-metrics
  44. subjects:
  45. - kind: ServiceAccount
  46. name: kube-state-metrics
  47. namespace: kube-system
  48. 通过kubectl apply更新yaml文件
  49. [root@master1 ~]# kubectl apply -f kube-state-metrics-rbac.yaml
  50. serviceaccount/kube-state-metrics created
  51. clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
  52. clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
  53. 安装kube-state-metrics组件
  54. 安装kube-state-metrics组件需要的镜像在课件,可上传到k8s各个工作节点,手动解压:
  55. docker load -i kube-state-metrics_1_9_0.tar.gz
  56. 在k8s的master1节点生成一个kube-state-metrics-deploy.yaml文件,kube-state-metrics-deploy.yaml在课件,可自行下载,内容如下:
  57. [root@master1 ~]# cat kube-state-metrics-deploy.yaml
  58. apiVersion: apps/v1
  59. kind: Deployment
  60. metadata:
  61. name: kube-state-metrics
  62. namespace: kube-system
  63. spec:
  64. replicas: 1
  65. selector:
  66. matchLabels:
  67. app: kube-state-metrics
  68. template:
  69. metadata:
  70. labels:
  71. app: kube-state-metrics
  72. spec:
  73. serviceAccountName: kube-state-metrics
  74. containers:
  75. - name: kube-state-metrics
  76. image: quay.io/coreos/kube-state-metrics:v1.9.0
  77. ports:
  78. - containerPort: 8080
  79. 通过kubectl apply更新yaml文件
  80. [root@master1 ~]# kubectl apply -f kube-state-metrics-deploy.yaml
  81. deployment.apps/kube-state-metrics created
  82. 查看kube-state-metrics是否部署成功
  83. [root@master1 ~]# kubectl get pods -n kube-system -l app=kube-state-metrics
  84. NAME READY STATUS RESTARTS AGE
  85. kube-state-metrics-58d4957bc5-9thsw 1/1 Running 0 30s
  86. 创建service
  87. 在k8s的控制节点生成一个kube-state-metrics-svc.yaml文件,kube-state-metrics-svc.yaml文件在课件,可上传到k8s的控制节点,内容如下:
  88. [root@master1 ~]# cat kube-state-metrics-svc.yaml
  89. apiVersion: v1
  90. kind: Service
  91. metadata:
  92. annotations:
  93. prometheus.io/scrape: 'true'
  94. name: kube-state-metrics
  95. namespace: kube-system
  96. labels:
  97. app: kube-state-metrics
  98. spec:
  99. ports:
  100. - name: kube-state-metrics
  101. port: 8080
  102. protocol: TCP
  103. selector:
  104. app: kube-state-metrics
  105. 通过kubectl apply更新yaml
  106. [root@master1 ~]# kubectl apply -f kube-state-metrics-svc.yaml
  107. service/kube-state-metrics created
  108. 查看service是否创建成功
  109. [root@master1 ~]# kubectl get svc -n kube-system | grep kube-state-metrics
  110. kube-state-metrics ClusterIP 10.105.160.224 <none> 8080/TCP
  111. 在grafana web界面导入Kubernetes Cluster (Prometheus)-1577674936972.json和Kubernetes cluster monitoring (via Prometheus) (k8s 1.16)-1577691996738.json,Kubernetes Cluster (Prometheus)-1577674936972.json和Kubernetes cluster monitoring (via Prometheus) (k8s 1.16)-1577691996738.json文件在课件
  112. 导入Kubernetes Cluster (Prometheus)-1577674936972.json之后出现如下页面

在grafana web界面导入Kubernetes cluster monitoring(via Prometheus) (k8s 1.16)-1577691996738.json,出现如下页面

  1. 1.6 安装和配置Alertmanager-发送报警到qq邮箱
  2. 在k8s的master1节点创建alertmanager-cm.yaml文件,alertmanager-cm.yaml文件在课件,可直接从课件传到k8s的master1节点,内容如下:
  3. [root@master1 ~]# cat alertmanager-cm.yaml
  4. kind: ConfigMap
  5. apiVersion: v1
  6. metadata:
  7. name: alertmanager
  8. namespace: monitor-sa
  9. data:
  10. alertmanager.yml: |-
  11. global:
  12. resolve_timeout: 1m
  13. smtp_smarthost: 'smtp.163.com:25'
  14. smtp_from: '1501157****@163.com'
  15. smtp_auth_username: '1501157****'
  16. smtp_auth_password: ' FLWYKIDBNBAIFFXV
  17. smtp_require_tls: false
  18. route: #用于配置告警分发策略
  19. group_by: [alertname] # 采用哪个标签来作为分组依据
  20. group_wait: 10s # 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出
  21. group_interval: 10s # 两组告警的间隔时间
  22. repeat_interval: 10m # 重复告警的间隔时间,减少相同邮件的发送频率
  23. receiver: default-receiver # 设置默认接收人
  24. receivers:
  25. - name: 'default-receiver'
  26. email_configs:
  27. - to: '1980570***@qq.com'
  28. send_resolved: true
  29. alertmanager配置文件解释说明:
  30. smtp_smarthost: 'smtp.163.com:25'
  31. #用于发送邮件的邮箱的SMTP服务器地址+端口
  32. smtp_from: '1501157****@163.com'
  33. #这是指定从哪个邮箱发送报警
  34. smtp_auth_username: '1501157****'
  35. #这是发送邮箱的认证用户,不是邮箱名
  36. smtp_auth_password: 'BDBPRMLNZGKWRFJP'
  37. #这是发送邮箱的授权码而不是登录密码
  38. email_configs:
  39. - to: '1980570***@qq.com'
  40. #to后面指定发送到哪个邮箱,我发送到我的qq邮箱,大家需要写自己的邮箱地址,不应该跟smtp_from的邮箱名字重复
  41. #通过kubectl apply 更新文件
  42. [root@master1 ~]# kubectl apply -f alertmanager-cm.yaml
  43. configmap/alertmanager created
  44. 在k8s的master1节点生成一个prometheus-alertmanager-cfg.yaml文件,prometheus-alertmanager-cfg.yaml文件在课件,上传到k8s的master1节点,内容如下:
  45. [root@master1 ~]# cat prometheus-alertmanager-cfg.yaml
  46. kind: ConfigMap
  47. apiVersion: v1
  48. metadata:
  49. labels:
  50. app: prometheus
  51. name: prometheus-config
  52. namespace: monitor-sa
  53. data:
  54. prometheus.yml: |
  55. rule_files:
  56. - /etc/prometheus/rules.yml
  57. alerting:
  58. alertmanagers:
  59. - static_configs:
  60. - targets: ["localhost:9093"]
  61. global:
  62. scrape_interval: 15s
  63. scrape_timeout: 10s
  64. evaluation_interval: 1m
  65. scrape_configs:
  66. - job_name: 'kubernetes-node'
  67. kubernetes_sd_configs:
  68. - role: node
  69. relabel_configs:
  70. - source_labels: [__address__]
  71. regex: '(.*):10250'
  72. replacement: '${1}:9100'
  73. target_label: __address__
  74. action: replace
  75. - action: labelmap
  76. regex: __meta_kubernetes_node_label_(.+)
  77. - job_name: 'kubernetes-node-cadvisor'
  78. kubernetes_sd_configs:
  79. - role: node
  80. scheme: https
  81. tls_config:
  82. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  83. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  84. relabel_configs:
  85. - action: labelmap
  86. regex: __meta_kubernetes_node_label_(.+)
  87. - target_label: __address__
  88. replacement: kubernetes.default.svc:443
  89. - source_labels: [__meta_kubernetes_node_name]
  90. regex: (.+)
  91. target_label: __metrics_path__
  92. replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
  93. - job_name: 'kubernetes-apiserver'
  94. kubernetes_sd_configs:
  95. - role: endpoints
  96. scheme: https
  97. tls_config:
  98. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  99. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  100. relabel_configs:
  101. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  102. action: keep
  103. regex: default;kubernetes;https
  104. - job_name: 'kubernetes-service-endpoints'
  105. kubernetes_sd_configs:
  106. - role: endpoints
  107. relabel_configs:
  108. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  109. action: keep
  110. regex: true
  111. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
  112. action: replace
  113. target_label: __scheme__
  114. regex: (https?)
  115. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
  116. action: replace
  117. target_label: __metrics_path__
  118. regex: (.+)
  119. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  120. action: replace
  121. target_label: __address__
  122. regex: ([^:]+)(?::\d+)?;(\d+)
  123. replacement: $1:$2
  124. - action: labelmap
  125. regex: __meta_kubernetes_service_label_(.+)
  126. - source_labels: [__meta_kubernetes_namespace]
  127. action: replace
  128. target_label: kubernetes_namespace
  129. - source_labels: [__meta_kubernetes_service_name]
  130. action: replace
  131. target_label: kubernetes_name
  132. - job_name: kubernetes-pods
  133. kubernetes_sd_configs:
  134. - role: pod
  135. relabel_configs:
  136. - action: keep
  137. regex: true
  138. source_labels:
  139. - __meta_kubernetes_pod_annotation_prometheus_io_scrape
  140. - action: replace
  141. regex: (.+)
  142. source_labels:
  143. - __meta_kubernetes_pod_annotation_prometheus_io_path
  144. target_label: __metrics_path__
  145. - action: replace
  146. regex: ([^:]+)(?::\d+)?;(\d+)
  147. replacement: $1:$2
  148. source_labels:
  149. - __address__
  150. - __meta_kubernetes_pod_annotation_prometheus_io_port
  151. target_label: __address__
  152. - action: labelmap
  153. regex: __meta_kubernetes_pod_label_(.+)
  154. - action: replace
  155. source_labels:
  156. - __meta_kubernetes_namespace
  157. target_label: kubernetes_namespace
  158. - action: replace
  159. source_labels:
  160. - __meta_kubernetes_pod_name
  161. target_label: kubernetes_pod_name
  162. - job_name: 'kubernetes-schedule'
  163. scrape_interval: 5s
  164. static_configs:
  165. - targets: ['192.168.40.130:10251']
  166. - job_name: 'kubernetes-controller-manager'
  167. scrape_interval: 5s
  168. static_configs:
  169. - targets: ['192.168.40.130:10252']
  170. - job_name: 'kubernetes-kube-proxy'
  171. scrape_interval: 5s
  172. static_configs:
  173. - targets: ['192.168.40.130:10249','192.168.40.131:10249']
  174. - job_name: 'kubernetes-etcd'
  175. scheme: https
  176. tls_config:
  177. ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt
  178. cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt
  179. key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key
  180. scrape_interval: 5s
  181. static_configs:
  182. - targets: ['192.168.40.130:2379']
  183. rules.yml: |
  184. groups:
  185. - name: example
  186. rules:
  187. - alert: kube-proxy的cpu使用率大于80%
  188. expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80
  189. for: 2s
  190. labels:
  191. severity: warnning
  192. annotations:
  193. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  194. - alert: kube-proxy的cpu使用率大于90%
  195. expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90
  196. for: 2s
  197. labels:
  198. severity: critical
  199. annotations:
  200. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  201. - alert: scheduler的cpu使用率大于80%
  202. expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80
  203. for: 2s
  204. labels:
  205. severity: warnning
  206. annotations:
  207. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  208. - alert: scheduler的cpu使用率大于90%
  209. expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90
  210. for: 2s
  211. labels:
  212. severity: critical
  213. annotations:
  214. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  215. - alert: controller-manager的cpu使用率大于80%
  216. expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80
  217. for: 2s
  218. labels:
  219. severity: warnning
  220. annotations:
  221. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  222. - alert: controller-manager的cpu使用率大于90%
  223. expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0
  224. for: 2s
  225. labels:
  226. severity: critical
  227. annotations:
  228. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  229. - alert: apiserver的cpu使用率大于80%
  230. expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80
  231. for: 2s
  232. labels:
  233. severity: warnning
  234. annotations:
  235. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  236. - alert: apiserver的cpu使用率大于90%
  237. expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90
  238. for: 2s
  239. labels:
  240. severity: critical
  241. annotations:
  242. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  243. - alert: etcd的cpu使用率大于80%
  244. expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80
  245. for: 2s
  246. labels:
  247. severity: warnning
  248. annotations:
  249. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  250. - alert: etcd的cpu使用率大于90%
  251. expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90
  252. for: 2s
  253. labels:
  254. severity: critical
  255. annotations:
  256. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  257. - alert: kube-state-metrics的cpu使用率大于80%
  258. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80
  259. for: 2s
  260. labels:
  261. severity: warnning
  262. annotations:
  263. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
  264. value: "{{ $value }}%"
  265. threshold: "80%"
  266. - alert: kube-state-metrics的cpu使用率大于90%
  267. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 0
  268. for: 2s
  269. labels:
  270. severity: critical
  271. annotations:
  272. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
  273. value: "{{ $value }}%"
  274. threshold: "90%"
  275. - alert: coredns的cpu使用率大于80%
  276. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80
  277. for: 2s
  278. labels:
  279. severity: warnning
  280. annotations:
  281. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
  282. value: "{{ $value }}%"
  283. threshold: "80%"
  284. - alert: coredns的cpu使用率大于90%
  285. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90
  286. for: 2s
  287. labels:
  288. severity: critical
  289. annotations:
  290. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
  291. value: "{{ $value }}%"
  292. threshold: "90%"
  293. - alert: kube-proxy打开句柄数>600
  294. expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 600
  295. for: 2s
  296. labels:
  297. severity: warnning
  298. annotations:
  299. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  300. value: "{{ $value }}"
  301. - alert: kube-proxy打开句柄数>1000
  302. expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 1000
  303. for: 2s
  304. labels:
  305. severity: critical
  306. annotations:
  307. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  308. value: "{{ $value }}"
  309. - alert: kubernetes-schedule打开句柄数>600
  310. expr: process_open_fds{job=~"kubernetes-schedule"} > 600
  311. for: 2s
  312. labels:
  313. severity: warnning
  314. annotations:
  315. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  316. value: "{{ $value }}"
  317. - alert: kubernetes-schedule打开句柄数>1000
  318. expr: process_open_fds{job=~"kubernetes-schedule"} > 1000
  319. for: 2s
  320. labels:
  321. severity: critical
  322. annotations:
  323. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  324. value: "{{ $value }}"
  325. - alert: kubernetes-controller-manager打开句柄数>600
  326. expr: process_open_fds{job=~"kubernetes-controller-manager"} > 600
  327. for: 2s
  328. labels:
  329. severity: warnning
  330. annotations:
  331. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  332. value: "{{ $value }}"
  333. - alert: kubernetes-controller-manager打开句柄数>1000
  334. expr: process_open_fds{job=~"kubernetes-controller-manager"} > 1000
  335. for: 2s
  336. labels:
  337. severity: critical
  338. annotations:
  339. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  340. value: "{{ $value }}"
  341. - alert: kubernetes-apiserver打开句柄数>600
  342. expr: process_open_fds{job=~"kubernetes-apiserver"} > 600
  343. for: 2s
  344. labels:
  345. severity: warnning
  346. annotations:
  347. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  348. value: "{{ $value }}"
  349. - alert: kubernetes-apiserver打开句柄数>1000
  350. expr: process_open_fds{job=~"kubernetes-apiserver"} > 1000
  351. for: 2s
  352. labels:
  353. severity: critical
  354. annotations:
  355. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  356. value: "{{ $value }}"
  357. - alert: kubernetes-etcd打开句柄数>600
  358. expr: process_open_fds{job=~"kubernetes-etcd"} > 600
  359. for: 2s
  360. labels:
  361. severity: warnning
  362. annotations:
  363. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  364. value: "{{ $value }}"
  365. - alert: kubernetes-etcd打开句柄数>1000
  366. expr: process_open_fds{job=~"kubernetes-etcd"} > 1000
  367. for: 2s
  368. labels:
  369. severity: critical
  370. annotations:
  371. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  372. value: "{{ $value }}"
  373. - alert: coredns
  374. expr: process_open_fds{k8s_app=~"kube-dns"} > 600
  375. for: 2s
  376. labels:
  377. severity: warnning
  378. annotations:
  379. description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过600"
  380. value: "{{ $value }}"
  381. - alert: coredns
  382. expr: process_open_fds{k8s_app=~"kube-dns"} > 1000
  383. for: 2s
  384. labels:
  385. severity: critical
  386. annotations:
  387. description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过1000"
  388. value: "{{ $value }}"
  389. - alert: kube-proxy
  390. expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"} > 2000000000
  391. for: 2s
  392. labels:
  393. severity: warnning
  394. annotations:
  395. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  396. value: "{{ $value }}"
  397. - alert: scheduler
  398. expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"} > 2000000000
  399. for: 2s
  400. labels:
  401. severity: warnning
  402. annotations:
  403. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  404. value: "{{ $value }}"
  405. - alert: kubernetes-controller-manager
  406. expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"} > 2000000000
  407. for: 2s
  408. labels:
  409. severity: warnning
  410. annotations:
  411. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  412. value: "{{ $value }}"
  413. - alert: kubernetes-apiserver
  414. expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"} > 2000000000
  415. for: 2s
  416. labels:
  417. severity: warnning
  418. annotations:
  419. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  420. value: "{{ $value }}"
  421. - alert: kubernetes-etcd
  422. expr: process_virtual_memory_bytes{job=~"kubernetes-etcd"} > 2000000000
  423. for: 2s
  424. labels:
  425. severity: warnning
  426. annotations:
  427. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  428. value: "{{ $value }}"
  429. - alert: kube-dns
  430. expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"} > 2000000000
  431. for: 2s
  432. labels:
  433. severity: warnning
  434. annotations:
  435. description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 使用虚拟内存超过2G"
  436. value: "{{ $value }}"
  437. - alert: HttpRequestsAvg
  438. expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m])) > 1000
  439. for: 2s
  440. labels:
  441. team: admin
  442. annotations:
  443. description: "组件{{$labels.job}}({{$labels.instance}}): TPS超过1000"
  444. value: "{{ $value }}"
  445. threshold: "1000"
  446. - alert: Pod_restarts
  447. expr: kube_pod_container_status_restarts_total{namespace=~"kube-system|default|monitor-sa"} > 0
  448. for: 2s
  449. labels:
  450. severity: warnning
  451. annotations:
  452. description: "在{{$labels.namespace}}名称空间下发现{{$labels.pod}}这个pod下的容器{{$labels.container}}被重启,这个监控指标是由{{$labels.instance}}采集的"
  453. value: "{{ $value }}"
  454. threshold: "0"
  455. - alert: Pod_waiting
  456. expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1
  457. for: 2s
  458. labels:
  459. team: admin
  460. annotations:
  461. description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}启动异常等待中"
  462. value: "{{ $value }}"
  463. threshold: "1"
  464. - alert: Pod_terminated
  465. expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|monitor-sa"} == 1
  466. for: 2s
  467. labels:
  468. team: admin
  469. annotations:
  470. description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}被删除"
  471. value: "{{ $value }}"
  472. threshold: "1"
  473. - alert: Etcd_leader
  474. expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0
  475. for: 2s
  476. labels:
  477. team: admin
  478. annotations:
  479. description: "组件{{$labels.job}}({{$labels.instance}}): 当前没有leader"
  480. value: "{{ $value }}"
  481. threshold: "0"
  482. - alert: Etcd_leader_changes
  483. expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0
  484. for: 2s
  485. labels:
  486. team: admin
  487. annotations:
  488. description: "组件{{$labels.job}}({{$labels.instance}}): 当前leader已发生改变"
  489. value: "{{ $value }}"
  490. threshold: "0"
  491. - alert: Etcd_failed
  492. expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0
  493. for: 2s
  494. labels:
  495. team: admin
  496. annotations:
  497. description: "组件{{$labels.job}}({{$labels.instance}}): 服务失败"
  498. value: "{{ $value }}"
  499. threshold: "0"
  500. - alert: Etcd_db_total_size
  501. expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000
  502. for: 2s
  503. labels:
  504. team: admin
  505. annotations:
  506. description: "组件{{$labels.job}}({{$labels.instance}}):db空间超过10G"
  507. value: "{{ $value }}"
  508. threshold: "10G"
  509. - alert: Endpoint_ready
  510. expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1
  511. for: 2s
  512. labels:
  513. team: admin
  514. annotations:
  515. description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.endpoint}}不可用"
  516. value: "{{ $value }}"
  517. threshold: "1"
  518. - name: 物理节点状态-监控告警
  519. rules:
  520. - alert: 物理节点cpu使用率
  521. expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90
  522. for: 2s
  523. labels:
  524. severity: ccritical
  525. annotations:
  526. summary: "{{ $labels.instance }}cpu使用率过高"
  527. description: "{{ $labels.instance }}的cpu使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
  528. - alert: 物理节点内存使用率
  529. expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90
  530. for: 2s
  531. labels:
  532. severity: critical
  533. annotations:
  534. summary: "{{ $labels.instance }}内存使用率过高"
  535. description: "{{ $labels.instance }}的内存使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
  536. - alert: InstanceDown
  537. expr: up == 0
  538. for: 2s
  539. labels:
  540. severity: critical
  541. annotations:
  542. summary: "{{ $labels.instance }}: 服务器宕机"
  543. description: "{{ $labels.instance }}: 服务器延时超过2分钟"
  544. - alert: 物理节点磁盘的IO性能
  545. expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
  546. for: 2s
  547. labels:
  548. severity: critical
  549. annotations:
  550. summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
  551. description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
  552. - alert: 入网流量带宽
  553. expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
  554. for: 2s
  555. labels:
  556. severity: critical
  557. annotations:
  558. summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
  559. description: "{{$labels.mountpoint }}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
  560. - alert: 出网流量带宽
  561. expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
  562. for: 2s
  563. labels:
  564. severity: critical
  565. annotations:
  566. summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
  567. description: "{{$labels.mountpoint }}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
  568. - alert: TCP会话
  569. expr: node_netstat_Tcp_CurrEstab > 1000
  570. for: 2s
  571. labels:
  572. severity: critical
  573. annotations:
  574. summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
  575. description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
  576. - alert: 磁盘容量
  577. expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
  578. for: 2s
  579. labels:
  580. severity: critical
  581. annotations:
  582. summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
  583. description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"
  584. 注意:配置文件解释说明
  585. - job_name: 'kubernetes-schedule'
  586. scrape_interval: 5s
  587. static_configs:
  588. - targets: ['192.168.40.130:10251'] #master1节点的ip:schedule端口
  589. - job_name: 'kubernetes-controller-manager'
  590. scrape_interval: 5s
  591. static_configs:
  592. - targets: ['192.168.40.130:10252'] #master1节点的ip:controller-manager端口
  593. - job_name: 'kubernetes-kube-proxy'
  594. scrape_interval: 5s
  595. static_configs:
  596. - targets: ['192.168.40.130:10249','192.168.40.131:10249']
  597. #master1和node1节点的ip:kube-proxy端口
  598. - job_name: 'kubernetes-etcd'
  599. scheme: https
  600. tls_config:
  601. ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt
  602. cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt
  603. key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key
  604. scrape_interval: 5s
  605. static_configs:
  606. - targets: ['192.168.40.130:2379']
  607. #master1节点的ip:etcd端口
  608. #更新资源清单文件
  609. [root@master1 ~]# kubectl delete -f prometheus-cfg.yaml
  610. configmap "prometheus-config" deleted
  611. [root@master1 ~]# kubectl apply -f prometheus-alertmanager-cfg.yaml
  612. configmap/prometheus-config created
  613. 安装prometheus和alertmanager
  614. 需要把alertmanager.tar.gz镜像包上传的k8s的各个节点,手动解压:
  615. docker load -i alertmanager.tar.gz
  616. 在k8s的master1节点生成一个prometheus-alertmanager-deploy.yaml文件,prometheus-alertmanager-deploy.yaml文件在课件里,可自行上传到k8s master1节点上,内容如下:
  617. [root@master1 ~]# cat prometheus-alertmanager-deploy.yaml
  618. ---
  619. apiVersion: apps/v1
  620. kind: Deployment
  621. metadata:
  622. name: prometheus-server
  623. namespace: monitor-sa
  624. labels:
  625. app: prometheus
  626. spec:
  627. replicas: 1
  628. selector:
  629. matchLabels:
  630. app: prometheus
  631. component: server
  632. #matchExpressions:
  633. #- {key: app, operator: In, values: [prometheus]}
  634. #- {key: component, operator: In, values: [server]}
  635. template:
  636. metadata:
  637. labels:
  638. app: prometheus
  639. component: server
  640. annotations:
  641. prometheus.io/scrape: 'false'
  642. spec:
  643. nodeName: node1
  644. serviceAccountName: monitor
  645. containers:
  646. - name: prometheus
  647. image: prom/prometheus:v2.2.1
  648. imagePullPolicy: IfNotPresent
  649. command:
  650. - "/bin/prometheus"
  651. args:
  652. - "--config.file=/etc/prometheus/prometheus.yml"
  653. - "--storage.tsdb.path=/prometheus"
  654. - "--storage.tsdb.retention=24h"
  655. - "--web.enable-lifecycle"
  656. ports:
  657. - containerPort: 9090
  658. protocol: TCP
  659. volumeMounts:
  660. - mountPath: /etc/prometheus
  661. name: prometheus-config
  662. - mountPath: /prometheus/
  663. name: prometheus-storage-volume
  664. - name: k8s-certs
  665. mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/
  666. - name: alertmanager
  667. image: prom/alertmanager:v0.14.0
  668. imagePullPolicy: IfNotPresent
  669. args:
  670. - "--config.file=/etc/alertmanager/alertmanager.yml"
  671. - "--log.level=debug"
  672. ports:
  673. - containerPort: 9093
  674. protocol: TCP
  675. name: alertmanager
  676. volumeMounts:
  677. - name: alertmanager-config
  678. mountPath: /etc/alertmanager
  679. - name: alertmanager-storage
  680. mountPath: /alertmanager
  681. - name: localtime
  682. mountPath: /etc/localtime
  683. volumes:
  684. - name: prometheus-config
  685. configMap:
  686. name: prometheus-config
  687. - name: prometheus-storage-volume
  688. hostPath:
  689. path: /data
  690. type: Directory
  691. - name: k8s-certs
  692. secret:
  693. secretName: etcd-certs
  694. - name: alertmanager-config
  695. configMap:
  696. name: alertmanager
  697. - name: alertmanager-storage
  698. hostPath:
  699. path: /data/alertmanager
  700. type: DirectoryOrCreate
  701. - name: localtime
  702. hostPath:
  703. path: /usr/share/zoneinfo/Asia/Shanghai
  704. 注意:
  705. 配置文件指定了nodeName: node1,这个位置要写你自己环境的node节点名字
  706. 生成一个etcd-certs,这个在部署prometheus需要
  707. [root@master1 ~]# kubectl -n monitor-sa create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt
  708. 通过kubectl apply更新yaml文件
  709. [root@master1 ~]# kubectl delete -f prometheus-deploy.yaml
  710. [root@master1 ~]# kubectl apply -f prometheus-alertmanager-deploy.yaml
  711. deployment.apps/prometheus-server created
  712. #查看prometheus是否部署成功
  713. kubectl get pods -n monitor-sa | grep prometheus
  714. 显示如下,可看到pod状态是running,说明prometheus部署成功
  715. prometheus-server-6c46df5b6-4l9b4 2/2 Running 0 38s
  716. 在k8s的master1节点生成一个alertmanager-svc.yaml文件,alertmanager-svc.yaml文件在课件里,可以手动上传到k8s的master1节点,内容如下:
  717. [root@master1 ~]# cat alertmanager-svc.yaml
  718. ---
  719. apiVersion: v1
  720. kind: Service
  721. metadata:
  722. labels:
  723. name: prometheus
  724. kubernetes.io/cluster-service: 'true'
  725. name: alertmanager
  726. namespace: monitor-sa
  727. spec:
  728. ports:
  729. - name: alertmanager
  730. nodePort: 30066
  731. port: 9093
  732. protocol: TCP
  733. targetPort: 9093
  734. selector:
  735. app: prometheus
  736. sessionAffinity: None
  737. type: NodePort
  738. #通过kubectl apply 更新yaml文件
  739. [root@master1 ~]# kubectl apply -f alertmanager-svc.yaml
  740. service/alertmanager created
  741. #查看service在物理机上映射的端口
  742. [root@master1 ~]# kubectl get svc -n monitor-sa
  743. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  744. alertmanager NodePort 10.98.142.161 <none> 9093:30066/TCP 56s
  745. prometheus NodePort 10.103.98.225 <none> 9090:30009/TCP 56m
  746. 注意:上面可以看到prometheus的service暴漏的端口是30009,alertmanager的service暴露的端口是30066
  747. 访问prometheus的web界面
  748. 点击status->targets,可看到如下

从上面可以发现kubernetes-controller-manager和kubernetes-schedule都显示连接不上对应的端口

  1. 可按如下方法处理;
  2. vim /etc/kubernetes/manifests/kube-scheduler.yaml
  3. 修改如下内容:
  4. 把--bind-address=127.0.0.1变成--bind-address=192.168.40.130
  5. 把httpGet:字段下的hosts由127.0.0.1变成192.168.40.130
  6. 把—port=0删除
  7. #注意:
  8. 192.168.40.130是k8s的控制节点master1节点ip
  9. vim /etc/kubernetes/manifests/kube-controller-manager.yaml
  10. 把--bind-address=127.0.0.1变成--bind-address=192.168.40.130
  11. 把httpGet:字段下的hosts由127.0.0.1变成192.168.40.130
  12. 把—port=0删除
  13. 修改之后在k8s各个节点执行
  14. systemctl restart kubelet
  15. kubectl get cs
  16. 显示如下:
  17. NAME STATUS MESSAGE ERROR
  18. controller-manager Healthy ok
  19. scheduler Healthy ok
  20. etcd-0 Healthy {"health":"true"}
  21. ss -antulp | grep :10251
  22. ss -antulp | grep :10252
  23. 可以看到相应的端口已经被物理机监听了
  24. 点击status->targets,可看到如下

kubernetes-kube-proxy显示如下:

  1. 是因为kube-proxy默认端口10249是监听在127.0.0.1上的,需要改成监听到物理节点上,按如下方法修改,线上建议在安装k8s的时候就做修改,这样风险小一些:
  2. kubectl edit configmap kube-proxy -n kube-system
  3. 把metricsBindAddress这段修改成metricsBindAddress: 0.0.0.0:10249
  4. 然后重新启动kube-proxy这个pod
  5. kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system
  6. ss -antulp |grep :10249
  7. 可显示如下
  8. [root@k8s-master ~]# ss -antulp | grep :10249
  9. tcp LISTEN 0 128 [::]:10249
  10. 点击Alerts,可看到如下

把kubernetes-etcd展开,可看到如下:

FIRING表示prometheus已经将告警发给alertmanager,在Alertmanager 中可以看到有一个 alert。

登录到alertmanagerweb界面,浏览器输入192.168.40.130:30066,显示如下

这样我在我的qq邮箱,1980570***@qq.com就可以收到报警了,如下

修改prometheus任何一个配置文件之后,可通过kubectl apply使配置生效,执行顺序如下:

kubectldelete -f alertmanager-cm.yaml

kubectlapply -f alertmanager-cm.yaml

kubectldelete -f prometheus-alertmanager-cfg.yaml

kubectlapply  -f prometheus-alertmanager-cfg.yaml

kubectldelete -f  prometheus-alertmanager-deploy.yaml
kubectl apply  -f prometheus-alertmanager-deploy.yaml

END

精彩文章推荐

从0开始轻松玩转k8s,助力企业实现智能化转型+世界500强实战项目汇总

K8s 常见问题

k8s超详细解读

K8s 超详细总结!

K8S 常见面试题总结

使用k3s部署轻量Kubernetes集群快速教程

基于Jenkins和k8s构建企业级DevOps容器云平台

k8s原生的CI/CD工具tekton

K8S二次开发-自定义CRD资源

 Docker+k8s+DevOps+Istio+Rancher+CKA+k8s故障排查训练营:

https://edu.51cto.com/topic/4735.html?qd=xglj

           

                   点亮,服务器10年不宕机

 点击阅读 | 了解更多

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/404679
推荐阅读
  

闽ICP备14008679号