赞
踩
原文网址:SkyWalking--告警--使用/教程_IT利刃出鞘的博客-CSDN博客
说明
本文介绍SkyWalking的告警功能的用法。
SkyWalking支持WebHook、gRPC、微信、钉钉、飞书等通知方式。
官网
alarm:https://github.com/apache/skywalking/blob/master/docs/en/setup/backend/backend-alarm.md
oal规则语法:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/oal.md
范围和字段:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/scope-definitions.md
事件:https://github.com/apache/skywalking/blob/master/docs/en/concepts-and-designs/event.md
- # Licensed to the Apache Software Foundation (ASF) under one
- # or more contributor license agreements. See the NOTICE file
- # distributed with this work for additional information
- # regarding copyright ownership. The ASF licenses this file
- # to you under the Apache License, Version 2.0 (the
- # "License"); you may not use this file except in compliance
- # with the License. You may obtain a copy of the License at
- #
- # http://www.apache.org/licenses/LICENSE-2.0
- #
- # Unless required by applicable law or agreed to in writing, software
- # distributed under the License is distributed on an "AS IS" BASIS,
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # limitations under the License.
-
- # Sample alarm rules.
- rules:
- # Rule unique name, must be ended with `_rule`.
- service_resp_time_rule:
- metrics-name: service_resp_time
- op: ">"
- threshold: 1000
- period: 10
- count: 3
- silence-period: 5
- message: 服务:{name}\n 指标:响应时间\n 详情:至少3次超过1000毫秒(最近10分钟内)
- service_sla_rule:
- # Metrics value need to be long, double or int
- metrics-name: service_sla
- op: "<"
- threshold: 8000
- # The length of time to evaluate the metrics
- period: 10
- # How many times after the metrics match the condition, will trigger alarm
- count: 2
- # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
- silence-period: 3
- message: 服务:{name}\n 指标:成功率\n 详情:至少2次低于80%(最近10分钟内)
- service_resp_time_percentile_rule:
- # Metrics value need to be long, double or int
- metrics-name: service_percentile
- op: ">"
- threshold: 1000,1000,1000,1000,1000
- period: 10
- count: 3
- silence-period: 5
- # 至少有一个条件达到:p50>1000、p75>1000、p90>1000、p95>1000、p99>1000
- message: 服务:{name}\n 指标:响应时间\n 详情:至少3次百分位超过1000ms(最近10分钟内)
- service_instance_resp_time_rule:
- metrics-name: service_instance_resp_time
- op: ">"
- threshold: 1000
- period: 10
- count: 2
- silence-period: 5
- message: 实例:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
- database_access_resp_time_rule:
- metrics-name: database_access_resp_time
- threshold: 1000
- op: ">"
- period: 10
- count: 2
- message: 数据库访问:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
- endpoint_relation_resp_time_rule:
- metrics-name: endpoint_relation_resp_time
- threshold: 1000
- op: ">"
- period: 10
- count: 2
- message: 端点关系:{name}\n 指标:响应时间\n 详情:至少2次超过1000毫秒(最近10分钟内)
- instance_jvm_old_gc_count_rule:
- metrics-name: instance_jvm_old_gc_count
- threshold: 1
- op: ">"
- period: 1440
- count: 1
- message: 实例:{name}\n 指标:OldGC次数\n 详情:最近1天内大于1次
- instance_jvm_young_gc_count_rule:
- metrics-name: instance_jvm_young_gc_count
- threshold: 1
- op: ">"
- period: 5
- count: 100
- message: 实例:{name}\n 指标:YoungGC次数\n 详情:最近5分钟内大于100次
- # 需要在config/oal/core.oal添加一行:endpoint_abnormal = from(Endpoint.*).filter(responseCode in [404, 500, 503]).count();
- endpoint_abnormal_rule:
- metrics-name: endpoint_abnormal
- threshold: 1
- op: ">="
- period: 2
- count: 1
- message: 接口:{name}\n 指标:接口异常\n 详情:最近2分钟内至少1次\n
- # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
- # Because the number of endpoint is much more than service and instance.
- #
- # endpoint_avg_rule:
- # metrics-name: endpoint_avg
- # op: ">"
- # threshold: 1000
- # period: 10
- # count: 2
- # silence-period: 5
- # message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
-
- webhooks:
- # - http://127.0.0.1/notify/
- # - http://127.0.0.1/go-wechat/
-
-
- dingtalkHooks:
- textTemplate: |-
- {
- "msgtype": "text",
- "text": {
- "content": "Apache SkyWalking 告警: \n %s"
- }
- }
- webhooks:
- - url: https://oapi.dingtalk.com/robot/send?access_token=<钉钉机器人的access_token>
- secret: <钉钉机器人的secret>
-
-
说明
Apache SkyWalking告警是由一组规则驱动。
告警规则的配置文件:SkyWalking服务端安装路径/config/alarm-settings.yml。
alarm-settings.yml中的rules.xxx_rule.metrics-name对应的是config/oal路径下的配置文件中的详细规则:core.oal、event.oal,java-agent.oal, browser.oal。
告警规则的组成部分
告警规则的定义分为三部分:
名词含义
Defines the relation between scope and entity name.
上边是文章的部分内容,为便于维护,全文已转移到此网址:SkyWalking-告警-使用教程 - 自学精灵
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。