当前位置:   article > 正文

6个步骤搞定云原生应用监控和告警(建议收藏)

6个步骤搞定云原生应用监控和告警(建议收藏)

云原生系统搭建完毕之后,要建立可观测性和告警,有利于了解整个系统的运行状况。基于Prometheus搭建的云原生监控和告警是业内常用解决方案,每个云原生参与者都需要了解。

本文主要以springboot应用为例,讲解云原生应用监控和告警的实操,对于理论知识讲解不多。等朋友们把实操都理顺之后,再补充理论知识,就更容易理解整个体系了。

1、监控告警技术选型

kubernetes集群非常复杂,有容器基础资源指标、k8s集群Node指标、集群里的业务应用指标等等。面对大量需要监控的指标,传统监控方案Zabbix对于云原生监控的支持不是很好。

所以需要使用更适合云原生的监控告警方案prometheus,prometheus和云原生是密不可分的,并且prometheus现已成为云原生生态中监控的事实标准。下面来一步步搭建基于prometheus的监控告警方案。

prometheus的基本原理是:主动去**被监控的系统**拉取各项指标,然后汇总存入到自身的时序数据库,最后再通过图表展示出来,或者是根据告警规则触发告警。被监控的系统要主动暴露接口给prometheus去抓取指标。流程图如下:

2、前置准备

本文的操作前提是:需要安装好docker、kubernetes,在K8S集群里部署好一个springboot应用。

假设K8S集群有4个节点,分别是:k8s-master(10.20.1.21)、k8s-worker-1(10.20.1.22)、k8s-worker-2(10.20.1.23)、k8s-worker-3(10.20.1.24)。

3、安装Prometheus

3.1、在k8s-master节点创建命名空间

kubectl create ns monitoring 

3.2、准备configmap文件

准备configmap文件prometheus-config.yaml,yaml文件中暂时只配置了对于prometheus本身指标的抓取任务。下文会继续补充这个yaml文件:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: prometheus-config
  5. namespace: monitoring
  6. data:
  7. prometheus.yml: |
  8. global:
  9. scrape_interval: 15s
  10. scrape_timeout: 15s
  11. scrape_configs:
  12. - job_name: 'prometheus'
  13. static_configs:
  14. - targets: ['localhost:9090']

3.3、创建configmap

kubectl apply -f prometheus-config.yaml

3.4、准备prometheus的部署文件

准备prometheus的部署文件prometheus-deploy.yaml

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: prometheus
  5. namespace: monitoring
  6. labels:
  7. app: prometheus
  8. spec:
  9. selector:
  10. matchLabels:
  11. app: prometheus
  12. template:
  13. metadata:
  14. labels:
  15. app: prometheus
  16. spec:
  17. serviceAccountName: prometheus
  18. containers:
  19. - image: prom/prometheus:v2.31.1
  20. name: prometheus
  21. securityContext:
  22. runAsUser: 0
  23. args:
  24. - "--config.file=/etc/prometheus/prometheus.yml"
  25. - "--storage.tsdb.path=/prometheus" # 指定tsdb数据路径
  26. - "--storage.tsdb.retention.time=24h"
  27. - "--web.enable-admin-api" # 控制对admin HTTP API的访问,其中包括删除时间序列等功能
  28. - "--web.enable-lifecycle" # 支持热更新,直接执行localhost:9090/-/reload立即生效
  29. ports:
  30. - containerPort: 9090
  31. name: http
  32. volumeMounts:
  33. - mountPath: "/etc/prometheus"
  34. name: config-volume
  35. - mountPath: "/prometheus"
  36. name: data
  37. resources:
  38. requests:
  39. cpu: 200m
  40. memory: 1024Mi
  41. limits:
  42. cpu: 200m
  43. memory: 1024Mi
  44. - image: jimmidyson/configmap-reload:v0.4.0 #prometheus配置动态加载
  45. name: prometheus-reload
  46. securityContext:
  47. runAsUser: 0
  48. args:
  49. - "--volume-dir=/etc/config"
  50. - "--webhook-url=http://localhost:9090/-/reload"
  51. volumeMounts:
  52. - mountPath: "/etc/config"
  53. name: config-volume
  54. resources:
  55. requests:
  56. cpu: 100m
  57. memory: 50Mi
  58. limits:
  59. cpu: 100m
  60. memory: 50Mi
  61. volumes:
  62. - name: data
  63. persistentVolumeClaim:
  64. claimName: prometheus-data
  65. - configMap:
  66. name: prometheus-config
  67. name: config-volume

3.5、准备prometheus的存储文件

准备prometheus的存储文件prometheus-storage.yaml

  1. apiVersion: storage.k8s.io/v1
  2. kind: StorageClass
  3. metadata:
  4. name: local-storage
  5. provisioner: kubernetes.io/no-provisioner
  6. volumeBindingMode: WaitForFirstConsumer
  7. ---
  8. apiVersion: v1
  9. kind: PersistentVolume
  10. metadata:
  11. name: prometheus-local
  12. labels:
  13. app: prometheus
  14. spec:
  15. accessModes:
  16. - ReadWriteOnce
  17. capacity:
  18. storage: 20Gi
  19. storageClassName: local-storage
  20. local:
  21. path: /data/k8s/prometheus #确保该节点上存在此目录
  22. persistentVolumeReclaimPolicy: Retain
  23. nodeAffinity:
  24. required:
  25. nodeSelectorTerms:
  26. - matchExpressions:
  27. - key: kubernetes.io/hostname
  28. operator: In
  29. values:
  30. - k8s-worker-2
  31. ---
  32. apiVersion: v1
  33. kind: PersistentVolumeClaim
  34. metadata:
  35. name: prometheus-data
  36. namespace: monitoring
  37. spec:
  38. selector:
  39. matchLabels:
  40. app: prometheus
  41. accessModes:
  42. - ReadWriteOnce
  43. resources:
  44. requests:
  45. storage: 20Gi
  46. storageClassName: local-storage

这里我使用的是k8s-worker-2节点作为存储资源,读者们使用时要改成自己的节点名称,同时要在对应的节点下创建目录:/data/k8s/prometheus。最终时序数据库的数据会存储到此目录下,见下图:

上面的yaml中用到了pv、pvc、storageclass存储相关的知识,后面写篇文章讲解下,这里简单介绍下:pv、pvc、storageclass主要是为pod自动创建存储资源相关的组件。

3.6、创建存储资源

kubectl apply -f prometheus-storage.yaml

3.7、准备用户、角色、权限相关文件

准备用户、角色、权限相关文件prometheus-rbac.yaml

  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. name: prometheus
  5. namespace: monitoring
  6. ---
  7. apiVersion: rbac.authorization.k8s.io/v1
  8. kind: ClusterRole
  9. metadata:
  10. name: prometheus
  11. rules:
  12. - apiGroups:
  13. - ""
  14. resources:
  15. - nodes
  16. - services
  17. - endpoints
  18. - pods
  19. - nodes/proxy
  20. verbs:
  21. - get
  22. - list
  23. - watch
  24. - apiGroups:
  25. - "extensions"
  26. resources:
  27. - ingresses
  28. verbs:
  29. - get
  30. - list
  31. - watch
  32. - apiGroups:
  33. - ""
  34. resources:
  35. - configmaps
  36. - nodes/metrics
  37. verbs:
  38. - get
  39. - nonResourceURLs:
  40. - /metrics
  41. verbs:
  42. - get
  43. ---
  44. apiVersion: rbac.authorization.k8s.io/v1
  45. kind: ClusterRoleBinding
  46. metadata:
  47. name: prometheus
  48. roleRef:
  49. apiGroup: rbac.authorization.k8s.io
  50. kind: ClusterRole
  51. name: prometheus
  52. subjects:
  53. - kind: ServiceAccount
  54. name: prometheus
  55. namespace: monitoring

3.8、创建RBAC资源

kubectl apply -f prometheus-rbac.yaml

3.9、创建deployment资源

kubectl apply -f prometheus-deploy.yaml

3.10、准备service资源对象文件

准备service资源对象文件prometheus-svc.yaml。采用NortPort方式,供外部访问prometheus:

  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: prometheus
  5. namespace: monitoring
  6. labels:
  7. app: prometheus
  8. spec:
  9. selector:
  10. app: prometheus
  11. type: NodePort
  12. ports:
  13. - name: web
  14. port: 9090
  15. targetPort: http

3.11、创建service对象:

kubectl apply -f prometheus-svc.yaml

3.12、访问prometheus

此时通过kubectl get svc -n monitoring获取暴露的端口号,通过K8S集群的任意节点+端口号就可以访问prometheus了。比如通过http://10.20.1.21:32459/访问,可以看到如下界面,通过targets可以看到上面prometheus-config.yaml文件中配置的被抓取对象:

至此prometheus安装完毕,下面继续安装grafana。

4、安装Grafana

prometheus的图表功能比较弱,一般使用grafana来展示prometheus的数据,下面开始安装grafana。

4.1、准备grafana部署文件

准备grafana部署文件grafana-deploy.yaml,这是一个all-in-one的文件,将Deployment、Service、PV、PVC的编排全部写在该文件中:

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: grafana
  5. namespace: monitoring
  6. spec:
  7. selector:
  8. matchLabels:
  9. app: grafana
  10. template:
  11. metadata:
  12. labels:
  13. app: grafana
  14. spec:
  15. volumes:
  16. - name: storage
  17. persistentVolumeClaim:
  18. claimName: grafana-data
  19. containers:
  20. - name: grafana
  21. image: grafana/grafana:8.3.3
  22. imagePullPolicy: IfNotPresent
  23. securityContext:
  24. runAsUser: 0
  25. ports:
  26. - containerPort: 3000
  27. name: grafana
  28. env:
  29. - name: GF_SECURITY_ADMIN_USER
  30. value: admin
  31. - name: GF_SECURITY_ADMIN_PASSWORD
  32. value: admin
  33. readinessProbe:
  34. failureThreshold: 10
  35. httpGet:
  36. path: /api/health
  37. port: 3000
  38. scheme: HTTP
  39. initialDelaySeconds: 60
  40. periodSeconds: 10
  41. successThreshold: 1
  42. timeoutSeconds: 30
  43. livenessProbe:
  44. failureThreshold: 3
  45. httpGet:
  46. path: /api/health
  47. port: 3000
  48. scheme: HTTP
  49. periodSeconds: 10
  50. successThreshold: 1
  51. timeoutSeconds: 1
  52. resources:
  53. limits:
  54. cpu: 400m
  55. memory: 1024Mi
  56. requests:
  57. cpu: 200m
  58. memory: 512Mi
  59. volumeMounts:
  60. - mountPath: /var/lib/grafana
  61. name: storage
  62. ---
  63. apiVersion: v1
  64. kind: Service
  65. metadata:
  66. name: grafana
  67. namespace: monitoring
  68. spec:
  69. type: NodePort
  70. ports:
  71. - port: 3000
  72. selector:
  73. app: grafana
  74. ---
  75. apiVersion: v1
  76. kind: PersistentVolume
  77. metadata:
  78. name: grafana-local
  79. labels:
  80. app: grafana
  81. spec:
  82. accessModes:
  83. - ReadWriteOnce
  84. capacity:
  85. storage: 1Gi
  86. storageClassName: local-storage
  87. local:
  88. path: /data/k8s/grafana #保证节点上创建好该目录
  89. persistentVolumeReclaimPolicy: Retain
  90. nodeAffinity:
  91. required:
  92. nodeSelectorTerms:
  93. - matchExpressions:
  94. - key: kubernetes.io/hostname
  95. operator: In
  96. values:
  97. - k8s-worker-2
  98. ---
  99. apiVersion: v1
  100. kind: PersistentVolumeClaim
  101. metadata:
  102. name: grafana-data
  103. namespace: monitoring
  104. spec:
  105. selector:
  106. matchLabels:
  107. app: grafana
  108. accessModes:
  109. - ReadWriteOnce
  110. resources:
  111. requests:
  112. storage: 1Gi
  113. storageClassName: local-storage

上文中依旧用到了PV、PVC、StorageClass的知识,节点亲和选择了k8s-worker-2节点,同时需要在该节点上创建改目录/data/k8s/grafana

4.2、部署grafana资源

kubectl apply -f grafana-deploy.yaml

4.3、访问grafana

查看对应的service端口映射:

通过链接http://10.20.1.21:31881/访问grafana,通过配置文件中的用户名和密码访问grafana,再导入prometheus的数据源:

5、配置数据抓取

5.1、配置抓取node数据

在抓取数据之前,需要在node节点上配置node-exporter,这样prometheus才能通过node-exporter暴露的接口抓取数据。

5.1.1、准备node-exporter的部署文件

准备node-exporter的部署文件node-exporter-daemonset.yaml

  1. apiVersion: apps/v1
  2. kind: DaemonSet
  3. metadata:
  4. name: node-exporter
  5. namespace: kube-system
  6. labels:
  7. app: node-exporter
  8. spec:
  9. selector:
  10. matchLabels:
  11. app: node-exporter
  12. template:
  13. metadata:
  14. labels:
  15. app: node-exporter
  16. spec:
  17. hostPID: true
  18. hostIPC: true
  19. hostNetwork: true
  20. nodeSelector:
  21. kubernetes.io/os: linux
  22. containers:
  23. - name: node-exporter
  24. image: prom/node-exporter:v1.3.1
  25. args:
  26. - --web.listen-address=$(HOSTIP):9100
  27. - --path.procfs=/host/proc
  28. - --path.sysfs=/host/sys
  29. - --path.rootfs=/host/root
  30. - --no-collector.hwmon # 禁用不需要的一些采集器
  31. - --no-collector.nfs
  32. - --no-collector.nfsd
  33. - --no-collector.nvme
  34. - --no-collector.dmi
  35. - --no-collector.arp
  36. - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/containerd/.+|/var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
  37. - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
  38. ports:
  39. - containerPort: 9100
  40. env:
  41. - name: HOSTIP
  42. valueFrom:
  43. fieldRef:
  44. fieldPath: status.hostIP
  45. resources:
  46. requests:
  47. cpu: 150m
  48. memory: 200Mi
  49. limits:
  50. cpu: 300m
  51. memory: 400Mi
  52. securityContext:
  53. runAsNonRoot: true
  54. runAsUser: 65534
  55. volumeMounts:
  56. - name: proc
  57. mountPath: /host/proc
  58. - name: sys
  59. mountPath: /host/sys
  60. - name: root
  61. mountPath: /host/root
  62. mountPropagation: HostToContainer
  63. readOnly: true
  64. tolerations: # 添加容忍
  65. - operator: "Exists"
  66. volumes:
  67. - name: proc
  68. hostPath:
  69. path: /proc
  70. - name: dev
  71. hostPath:
  72. path: /dev
  73. - name: sys
  74. hostPath:
  75. path: /sys
  76. - name: root
  77. hostPath:
  78. path: /
5.1.2、部署node-exporter
kubectl apply -f  node-exporter-daemonset.yaml
5.1.3、prometheus接入抓取数据

在之前的prometheus-config.yaml文件中继续增加job-name,如下:

  1. - job_name: kubernetes-nodes
  2. kubernetes_sd_configs:
  3. - role: node
  4. relabel_configs:
  5. - source_labels: [__address__]
  6. regex: '(.*):10250'
  7. replacement: '${1}:9100'
  8. target_label: __address__
  9. action: replace
  10. - action: labelmap
  11. regex: __meta_kubernetes_node_label_(.+)

完整的prometheus-config.yaml见文末。

prometheus-config.yaml文件修改完,稍等一会儿就可以看到页面多了几个target,如下图所示,这些都是被prometheus监控的对象:

5.2、配置抓取springboot actuator数据

5.2.1、配置springboot应用
  • springboot应用增加pom
  1. <dependency>
  2. <groupId>org.springframework.boot</groupId>
  3. <artifactId>spring-boot-starter-actuator</artifactId>
  4. </dependency>
  5. <dependency>
  6. <groupId>io.micrometer</groupId>
  7. <artifactId>micrometer-registry-prometheus</artifactId>
  8. </dependency>
  • springboot应用配置properties文件:
  1. management.endpoint.health.probes.enabled=true
  2. management.health.probes.enabled=true
  3. management.endpoint.health.enabled=true
  4. management.endpoint.health.show-details=always
  5. management.endpoints.web.exposure.include=*
  6. management.endpoints.web.exposure.exclude=env,beans
  7. management.endpoint.shutdown.enabled=true
  8. management.server.port=9090
  • 查看指标链接

配置完之后,重新打镜像部署到K8S集群,这里不做演示了。访问应用的/actuator/prometheus链接得到如下结果,将系统的指标信息暴露出来:

5.2.2、prometheus接入抓取数据

继续修改配置文件prometheus-config.yaml,如下:

  1. - job_name: 'spring-actuator-many'
  2. metrics_path: '/actuator/prometheus'
  3. scrape_interval: 5s
  4. kubernetes_sd_configs:
  5. - role: pod
  6. relabel_configs:
  7. - source_labels: [__meta_kubernetes_namespace]
  8. separator: ;
  9. regex: 'test1'
  10. target_label: namespace
  11. action: keep
  12. - source_labels: [__address__]
  13. regex: '(.*):9090'
  14. target_label: __address__
  15. action: keep
  16. - action: labelmap
  17. regex: __meta_kubernetes_pod_label_(.+)

配置文件中的大概意思是,选择“端口是9090,namespace是test1”的pod资源进行监控。更多的语法,读者自行查阅prometheus官网。

稍等片刻,可以看到多了springboot应用的监控目标:

6、配置监控图表

指标数据都有了,接下来就是如何配置图表了。grafana提供了丰富的图表,可以在官网上自行选择。下文继续配置监控node的图表 和 监控springboot应用的图表

配置图表有3种方式:json文件、输入图表id、输入json内容。配置界面如下图:

6.1、配置node监控图表

在上图的界面中选择输入图表id的方式,输入图表id8919,即可看到如下界面:

6.2、配置springboot应用的图表

在上图的界面中选择输入json内容的方式,输入此链接下的json内容https://img.mangod.top/blog/jvm-micrometer.json,即可看到如下图表:

至此k8s-node监控和springboot应用监控已经完成。如果还需要更多的监控,读者需要自行查阅资料。

7、安装告警alertmanager

监控完成之后,就是安装告警组件alertmanager了。可以选择在K8S集群下的任一节点使用docker安装。

7.1、安装alertmanager

7.1.1、拉取docker镜像
  1. docker pull prom/alertmanager:v0.25.0
7.1.2、创建报警配置文件

创建报警配置文件alertmanager.yml之前,需要在安装alertmanager所在节点上创建目录/data/prometheus/alertmanager,在目录下创建文件alertmanager.yml,内容如下:

  1. route:
  2. group_by: ['alertname']
  3. group_wait: 30s
  4. group_interval: 5m
  5. repeat_interval: 1h
  6. receiver: 'mail_163'
  7. global:
  8. smtp_smarthost: 'smtp.qq.com:465'
  9. smtp_from: '294931067@qq.com'
  10. smtp_auth_username: '294931067@qq.com'
  11. # 此处是发送邮件的授权码,不是密码
  12. smtp_auth_password: '此处是授权码,比如sdfasdfsdffsfa'
  13. smtp_require_tls: false
  14. receivers:
  15. - name: 'mail_163'
  16. email_configs:
  17. - to: 'yclxiao@163.com'
  18. send_resolved: true
  19. inhibit_rules:
  20. - source_match:
  21. severity: 'critical'
  22. target_match:
  23. severity: 'warning'
  24. equal: ['alertname', 'dev', 'instance']
7.1.3、安装启动:
  1. docker run --name alertmanager -d -p 9093:9093  -v /data/prometheus/alertmanager:/etc/alertmanager prom/alertmanager:v0.25.0
7.1.4、访问alertmanager

安装完毕之后,通过如下链接访问:http://10.20.1.21:9093/#/alerts,界面如下:

7.2、与prometheus关联

prometheus-configmap.yaml文件中增加如下配置,即可让prometheus与alertmanager关联起来,配置中的地址改成自己的prometheus地址。

7.3、配置触发告警规则

7.3.1、增加配置目录

prometheus-configmap.yaml文件中增加如下配置,即可增加触发告警的规则:

注意此处的文件目录/prometheus/是prometheus所在存储目录,我这里是安装在k8s-worker-2下,然后在prometheus的存储目录下建立/rules文件夹,如下图:

至此prometheus-config.yaml全部配置完毕,最后附上完整的prometheus-config.yaml

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: prometheus-config
  5. namespace: monitoring
  6. data:
  7. prometheus.yml: |
  8. global:
  9. scrape_interval: 15s
  10. scrape_timeout: 15s
  11. alerting:
  12. alertmanagers:
  13. - static_configs:
  14. - targets:
  15. - 10.20.1.21:9093
  16. rule_files:
  17. - /prometheus/rules/*.rules
  18. scrape_configs:
  19. - job_name: 'prometheus'
  20. static_configs:
  21. - targets: ['localhost:9090']
  22. - job_name: "cadvisor"
  23. kubernetes_sd_configs:
  24. - role: node
  25. scheme: https
  26. tls_config:
  27. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  28. insecure_skip_verify: true
  29. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  30. relabel_configs:
  31. - action: labelmap
  32. regex: __meta_kubernetes_node_label_(.+)
  33. replacement: $1
  34. - replacement: /metrics/cadvisor # <nodeip>/metrics -> <nodeip>/metrics/cadvisor
  35. target_label: __metrics_path__
  36. - job_name: kubernetes-nodes
  37. kubernetes_sd_configs:
  38. - role: node
  39. relabel_configs:
  40. - source_labels: [__address__]
  41. regex: '(.*):10250'
  42. replacement: '${1}:9100'
  43. target_label: __address__
  44. action: replace
  45. - action: labelmap
  46. regex: __meta_kubernetes_node_label_(.+)
  47. - job_name: 'spring-actuator-many'
  48. metrics_path: '/actuator/prometheus'
  49. scrape_interval: 5s
  50. kubernetes_sd_configs:
  51. - role: pod
  52. relabel_configs:
  53. - source_labels: [__meta_kubernetes_namespace]
  54. separator: ;
  55. regex: 'test1'
  56. target_label: namespace
  57. action: keep
  58. - source_labels: [__address__]
  59. regex: '(.*):9090'
  60. target_label: __address__
  61. action: keep
  62. - action: labelmap
  63. regex: __meta_kubernetes_pod_label_(.+)
7.3.2、配置触发告警规则

触发告警规则的目录已经定好了,接下来就是写具体规则了,在目录下创建2个触发告警的规则文件,如上图,文件中写了触发node节点告警规则触发springboot应用的告警规则,具体内容如下:

  • node节点告警规则-hoststats-alert.yaml
  1. groups:
  2. - name: hostStatsAlert
  3. rules:
  4. - alert: hostCpuUsageAlert
  5. expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance) > 0.85
  6. for: 1m
  7. labels:
  8. severity: page
  9. annotations:
  10. summary: "Instance {{ $labels.instance }} CPU usgae high"
  11. description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  12. - alert: hostMemUsageAlert
  13. expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
  14. for: 1m
  15. labels:
  16. severity: page
  17. annotations:
  18. summary: "Instance {{ $labels.instance }} MEM usgae high"
  19. description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
  • springboot应用告警规则-jvm-metrics-rules.yaml
  1. groups:
  2. - name: jvm-metrics-rules
  3. rules:
  4. # 在5分钟里,GC花费时间超过10%
  5. - alert: GcTimeTooMuch
  6. expr: increase(jvm_gc_collection_seconds_sum[5m]) > 30
  7. for: 5m
  8. labels:
  9. severity: red
  10. annotations:
  11. summary: "{{ $labels.app }} GC时间占比超过10%"
  12. message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} GC时间占比超过10%,当前值({{ $value }}%)"
  13. # GC次数太多
  14. - alert: GcCountTooMuch
  15. expr: increase(jvm_gc_collection_seconds_count[1m]) > 30
  16. for: 1m
  17. labels:
  18. severity: red
  19. annotations:
  20. summary: "{{ $labels.app }} 1分钟GC次数>30次"
  21. message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} 1分钟GC次数>30次,当前值({{ $value }})"
  22. # FGC次数太多
  23. - alert: FgcCountTooMuch
  24. expr: increase(jvm_gc_collection_seconds_count{gc="ConcurrentMarkSweep"}[1h]) > 3
  25. for: 1m
  26. labels:
  27. severity: red
  28. annotations:
  29. summary: "{{ $labels.app }} 1小时的FGC次数>3次"
  30. message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} 1小时的FGC次数>3次,当前值({{ $value }})"
  31. # 非堆内存使用超过80%
  32. - alert: NonheapUsageTooMuch
  33. expr: jvm_memory_bytes_used{job="spring-actuator-many", area="nonheap"} / jvm_memory_bytes_max * 100 > 80
  34. for: 1m
  35. labels:
  36. severity: red
  37. annotations:
  38. summary: "{{ $labels.app }} 非堆内存使用>80%"
  39. message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} 非堆内存使用率>80%,当前值({{ $value }}%)"
  40. # 内存使用预警
  41. - alert: HeighMemUsage
  42. expr: process_resident_memory_bytes{job="spring-actuator-many"} / os_total_physical_memory_bytes * 100 > 15
  43. for: 1m
  44. labels:
  45. severity: red
  46. annotations:
  47. summary: "{{ $labels.app }} rss内存使用率大于85%"
  48. message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} rss内存使用率大于85%,当前值({{ $value }}%)"
  49. # JVM高内存使用预警
  50. - alert: JavaHeighMemUsage
  51. expr: sum(jvm_memory_used_bytes{area="heap",job="spring-actuator-many"}) by(app,instance) / sum(jvm_memory_max_bytes{area="heap",job="spring-actuator-many"}) by(app,instance) * 100 > 85
  52. for: 1m
  53. labels:
  54. severity: red
  55. annotations:
  56. summary: "{{ $labels.app }} rss内存使用率大于85%"
  57. message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} rss内存使用率大于85%,当前值({{ $value }}%)"
  58. # CPU使用预警
  59. - alert: JavaHeighCpuUsage
  60. expr: system_cpu_usage{job="spring-actuator-many"} * 100 > 85
  61. for: 1m
  62. labels:
  63. severity: red
  64. annotations:
  65. summary: "{{ $labels.app }} rss CPU使用率大于85%"
  66. message: "ns:{{ $labels.namespace }} pod:{{ $labels.pod }} rss内存使用率大于85%,当前值({{ $value }}%)"
  • 告警文件准备好之后,先重启alertmanager,再重启prometheus:

  1. kubectl delete -f prometheus-deploy.yaml
  2. kubectl apply -f prometheus-deploy.yaml
  • 查看界面

此时查看alertmanager的status,可以看到如下界面:

此时查看promethetus的rules,可以看到如下界面:

7.3.3、注意点
  • 改了alertmanager的告警配置要重启alertmanager才生效。
  • alertmanager.yml中的smtp_auth_password配置的是邮件发送的授权码,不是邮箱密码。邮箱的授权码的配置如下图,下图以QQ邮箱为例:

至此基于Prometheus和Grafana的监控和告警已经安装完毕。

8、测试告警

安装完毕后,简单测试下告警效果。有2种方式测试。

  • 方式1:将告警规则值调低,会收到如下邮件:

  • 方式2:通过命令cat /dev/zero>/dev/null拉高node节点的cpu或者拉高容器的cpu,,会收到如下邮件:

9、总结

本文主要讲解基于Prometheus + Grafana的云原生应用监控和告警的实战,助你快速搭建系统,希望对你有帮助!

本篇完结!感谢你的阅读,欢迎点赞 关注 收藏 私信!!!

原文链接:6个步骤搞定云原生应用监控和告警(建议收藏) - 不焦躁的程序员6个步骤搞定云原生应用监控和告警(建议收藏)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/618785
推荐阅读
相关标签
  

闽ICP备14008679号