赞
踩
Skywalking每隔一段时间根据收集到的链路追踪的数据和配置的告警规则(如服务响应时间、服务响应 时间百分比)等,判断如果达到阈值则发送相应的告警信息。发送告警信息是通过调用webhook接口完 成,具体的webhook接口可以使用者自行定义,从而开发者可以在指定的webhook接口中编写各种告 警方式,比如邮件、短信等。告警的信息也可以在RocketBot中查看到。
以下是默认的告警规则配置,位于skywalking安装目录下的config文件夹下 alarm-settings.yml文件 中:
rules: # Rule unique name, must be ended with `_rule`. service_resp_time_rule: metrics-name: service_resp_time op: ">" threshold: 1000 period: 10 count: 3 silence-period: 5 message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes. service_sla_rule: # Metrics value need to be long, double or int metrics-name: service_sla op: "<" threshold: 8000 # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 2 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 3 message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes service_p90_sla_rule: # Metrics value need to be long, double or int metrics-name: service_p90 op: ">" threshold: 1000 period: 10 count: 3 silence-period: 5 message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes service_instance_resp_time_rule: metrics-name: service_instance_resp_time op: ">" threshold: 1000 period: 10 count: 2 silence-period: 5 message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm. # Because the number of endpoint is much more than service and instance. # # endpoint_avg_rule: # metrics-name: endpoint_avg # op: ">" # threshold: 1000 # period: 10 # count: 2 # silence-period: 5 # message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes webhooks: # - http://127.0.0.1/notify/ # - http://127.0.0.1/go-wechat/
以上文件定义了默认的4种规则
属性参照表
属性 | 含义 |
---|---|
metrics-name | oal脚本中的度量名称 |
threshold | 阈值,与metrics-name和下面的比较符号相匹配 |
op | 比较操作符,可以设定>,<,= |
period | 多久检查一次当前的指标数据是否符合告警规则,单位分钟 |
count | 达到多少次后,发送告警消息 |
silence-period | 在多久之内,忽略相同的告警消息 |
message | 告警消息内容 |
include-names | 本规则告警生效的服务列表 |
webhooks可以配置告警产生时的调用地址。
编写告警功能接口来进行测试,创建skywalking_alarm项目。
AlarmController
import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RestController; @RestController public class AlarmController { //每次调用睡眠1.5秒,模拟超时的报警 @GetMapping("/timeout") public String timeout(){ try { Thread.sleep(1500); } catch (InterruptedException e) { e.printStackTrace(); } return "timeout"; } }
该接口主要用于模拟超时,多次调用之后就可以生成告警信息。
WebHooks
import com.sf.saas.skywalking_alarm.pojo.AlarmMessage; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.PostMapping; import org.springframework.web.bind.annotation.RequestBody; import org.springframework.web.bind.annotation.RestController; import java.util.ArrayList; import java.util.List; @RestController public class WebHooks { private List<AlarmMessage> lastList = new ArrayList<>(); @PostMapping("/webhook") public void webhook(@RequestBody List<AlarmMessage> alarmMessageList){ lastList = alarmMessageList; } @GetMapping("/show") public List<AlarmMessage> show(){ return lastList; } }
产生告警时会调用webhook接口,该接口必须是Post类型,同时接口参数使用RequestBody。参 数格式为:
[{ "scopeId": 1, "scope": "SERVICE", "name": "serviceA", "id0": 12, "id1": 0, "ruleName": "service_resp_time_rule", "alarmMessage": "alarmMessage xxxx", "startTime": 1560524171000 }, { "scopeId": 1, "scope": "SERVICE", "name": "serviceB", "id0": 23, "id1": 0, "ruleName": "service_resp_time_rule", "alarmMessage": "alarmMessage yyy", "startTime": 1560524171000 }]
AlarmMessage
public class AlarmMessage { private int scopeId; private String name; private int id0; private int id1; //告警的消息 private String alarmMessage; //告警的产生时间 private long startTime; public int getScopeId() { return scopeId; } public void setScopeId(int scopeId) { this.scopeId = scopeId; } public String getName() { return name; } public void setName(String name) { this.name = name; } public int getId0() { return id0; } public void setId0(int id0) { this.id0 = id0; } public int getId1() { return id1; } public void setId1(int id1) { this.id1 = id1; } public String getAlarmMessage() { return alarmMessage; } public void setAlarmMessage(String alarmMessage) { this.alarmMessage = alarmMessage; } public long getStartTime() { return startTime; } public void setStartTime(long startTime) { this.startTime = startTime; } @Override public String toString() { return "AlarmMessage{" + "scopeId=" + scopeId + ", name='" + name + '\'' + ", id0=" + id0 + ", id1=" + id1 + ", alarmMessage='" + alarmMessage + '\'' + ", startTime=" + startTime + '}'; } }
实体类用于接口告警信息
首先需要修改告警规则配置文件,将webhook地址修改为
webhooks:
- http://127.0.0.1:8089/webhook
然后重启skywalking
1、将 skywalking_alarm.jar上传至 /usr/local/skywalking目录下。
2、启动skywalking_alarm应用,等待启动成功。
java -javaagent:/usr/local/skywalking/apache-skywalking-apm-
bin/agent/skywalking-agent.jar -Dskywalking.agent.service_name=skywalking_alarm -jar skywalking_alarm.jar
3、不停调用接口,接口地址为:http://虚拟机IP:8089/timeout
4、直到出现告警:
5、查看告警信息接口:http://虚拟机IP:8089/show
从上图中可以看到,我们已经获取到了告警相关的信息,在生产中使用可以在webhook接口中对接短 信、邮件等平台,当告警出现时能迅速发送信息给对应的处理人员,提高故障处理的速度。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。