当前位置:   article > 正文

SkyWalking--告警--使用/教程/示例_skywalking 告警配置

skywalking 告警配置

原文网址:SkyWalking--告警--使用/教程_IT利刃出鞘的博客-CSDN博客

简介

说明

本文介绍SkyWalking的告警功能的用法。

SkyWalking支持WebHook、gRPC、微信、钉钉、飞书等通知方式。

官网

alarm:https://github.com/apache/skywalking/blob/master/docs/en/setup/backend/backend-alarm.md

oal规则语法:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/oal.md

范围和字段:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/scope-definitions.md

事件:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/event.md

配置示例

  1. # Licensed to the Apache Software Foundation (ASF) under one
  2. # or more contributor license agreements. See the NOTICE file
  3. # distributed with this work for additional information
  4. # regarding copyright ownership. The ASF licenses this file
  5. # to you under the Apache License, Version 2.0 (the
  6. # "License"); you may not use this file except in compliance
  7. # with the License. You may obtain a copy of the License at
  8. #
  9. # http://www.apache.org/licenses/LICENSE-2.0
  10. #
  11. # Unless required by applicable law or agreed to in writing, software
  12. # distributed under the License is distributed on an "AS IS" BASIS,
  13. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14. # See the License for the specific language governing permissions and
  15. # limitations under the License.
  16. # Sample alarm rules.
  17. rules:
  18. # Rule unique name, must be ended with `_rule`.
  19. service_resp_time_rule:
  20. metrics-name: service_resp_time
  21. op: ">"
  22. threshold: 1000
  23. period: 10
  24. count: 3
  25. silence-period: 5
  26. message: 服务:{name}\n 指标:响应时间\n 详情:至少3次超过1000毫秒(最近10分钟内)
  27. service_sla_rule:
  28. # Metrics value need to be long, double or int
  29. metrics-name: service_sla
  30. op: "<"
  31. threshold: 8000
  32. # The length of time to evaluate the metrics
  33. period: 10
  34. # How many times after the metrics match the condition, will trigger alarm
  35. count: 2
  36. # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
  37. silence-period: 3
  38. message: 服务:{name}\n 指标:成功率\n 详情:至少2次低于80%(最近10分钟内)
  39. service_resp_time_percentile_rule:
  40. # Metrics value need to be long, double or int
  41. metrics-name: service_percentile
  42. op: ">"
  43. threshold: 1000,1000,1000,1000,1000
  44. period: 10
  45. count: 3
  46. silence-period: 5
  47. # 至少有一个条件达到:p50>1000、p75>1000、p90>1000、p95>1000、p99>1000
  48. message: 服务:{name}\n 指标:响应时间\n 详情:至少3次百分位超过1000ms(最近10分钟内)
  49. service_instance_resp_time_rule:
  50. metrics-name: service_instance_resp_time
  51. op: ">"
  52. threshold: 1000
  53. period: 10
  54. count: 2
  55. silence-period: 5
  56. message: 实例:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
  57. database_access_resp_time_rule:
  58. metrics-name: database_access_resp_time
  59. threshold: 1000
  60. op: ">"
  61. period: 10
  62. count: 2
  63. message: 数据库访问:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
  64. endpoint_relation_resp_time_rule:
  65. metrics-name: endpoint_relation_resp_time
  66. threshold: 1000
  67. op: ">"
  68. period: 10
  69. count: 2
  70. message: 端点关系:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
  71. instance_jvm_old_gc_count_rule:
  72. metrics-name: instance_jvm_old_gc_count
  73. threshold: 1
  74. op: ">"
  75. period: 1440
  76. count: 1
  77. message: 实例:{name}\n 指标:OldGC次数\n 详情:最近1天内大于1次
  78. instance_jvm_young_gc_count_rule:
  79. metrics-name: instance_jvm_young_gc_count
  80. threshold: 1
  81. op: ">"
  82. period: 5
  83. count: 100
  84. message: 实例:{name}\n 指标:YoungGC次数\n 详情:最近5分钟内大于100次
  85. # 需要在config/oal/core.oal添加一行:endpoint_abnormal = from(Endpoint.*).filter(responseCode in [404, 500, 503]).count();
  86. endpoint_abnormal_rule:
  87. metrics-name: endpoint_abnormal
  88. threshold: 1
  89. op: ">="
  90. period: 2
  91. count: 1
  92. message: 接口:{name}\n 指标:接口异常\n 详情:最近2分钟内至少1次\n
  93. # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
  94. # Because the number of endpoint is much more than service and instance.
  95. #
  96. # endpoint_avg_rule:
  97. # metrics-name: endpoint_avg
  98. # op: ">"
  99. # threshold: 1000
  100. # period: 10
  101. # count: 2
  102. # silence-period: 5
  103. # message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
  104. webhooks:
  105. # - http://127.0.0.1/notify/
  106. # - http://127.0.0.1/go-wechat/
  107. dingtalkHooks:
  108. textTemplate: |-
  109. {
  110. "msgtype": "text",
  111. "text": {
  112. "content": "Apache SkyWalking 告警: \n %s"
  113. }
  114. }
  115. webhooks:
  116. - url: https://oapi.dingtalk.com/robot/send?access_token=<钉钉机器人的access_token>
  117. secret: <钉钉机器人的secret>

告警简介

说明

Apache SkyWalking告警是由一组规则驱动。

告警规则的配置文件:SkyWalking服务端安装路径/config/alarm-settings.yml。

alarm-settings.yml中的rules.xxx_rule.metrics-name对应的是config/oal路径下的配置文件中的详细规则:core.oal、event.oal,java-agent.oal, browser.oal。

告警规则的组成部分

告警规则的定义分为三部分:

  1. 告警规则:定义了触发告警所考虑的条件。
  2. WebHooks:当告警触发时,被调用的服务端点列表。
  3. gRPCHook:当告警触发时,被调用的远程gRPC方法的主机和端口。

名词含义

Defines the relation between scope and entity name.

  • Service: Service name
  • Instance: {Instance name} of {Service name}
  • Endpoint: {Endpoint name} in {Service name}
    • 端点。即:接口(也就是url)
    • endpoint 规则相比 service、instance 规则耗费更多内存及资源
  • Database: Database service name
  • Service Relation: {Source service name} to {Dest service name}
  • Instance Relation: {Source instance name} of {Source service name} to {Dest instance name} of {Dest service name}
  • Endpoint Relation: {Source endpoint name} in {Source Service name} to {Dest endpoint name} in {Dest service name}

规则

上边是文章的部分内容,为便于维护,全文已转移到此网址:SkyWalking-告警-使用教程 - 自学精灵

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/236430
推荐阅读
相关标签
  

闽ICP备14008679号