Ambari2.7.3集群告警信息_hdfs storage capacity usage (weekly)

作者：weixin_40725706 | 2024-06-07 02:49:58

踩

hdfs storage capacity usage (weekly)

文章目录

1.告警(Alert)级别
2.告警(Alert)类型
3.Ambari告警描述(Description)

1.告警(Alert)级别

告警级别	解释	描述
`OK`	正常	集群运行良好
`WARNING`	警告	集群指标超过设定阈值，需注意
`CRITICAL`	危险	集群的运行存在问题，需进一步处理
`UNKNOWN`	未知	集群状态未知
`NONE`	无	无

2.告警(Alert)类型

类型	用途	告警级别	阈值是否可配置	单位
`PORT`	监测节点端口是否可用	OK,WARN,CRIT	是	秒
`METRIC`	监测Metric相关配置属性	OK,WARN,CRIT	是	变量
`AGGREGATE`	收集其它某些Alert状态	OK,WARN,CRIT	是	百分比
`WEB`	监测WEB（URL）地址是否可用	OK,WARN,CRIT	否	无
`SCRIPT`	Alert的监测逻辑由一个自定义的 python脚本执行	OK,WARN,CRIT	否	无

例如对如下告警信息进行解读：

字段	解释	备注
`Service`	服务	在其中可选具体组件以查看其告警状态
`Host`	主机id	显示所告警的虚拟机id
`Status`	状态	显示告警级别，分为5种类型
`24-Hour`	告警时长
`Response`	响应	告警具体内容，点击可显示具体所告警全部内容

3.Ambari告警描述(Description)

3.1 HDFS

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
HDFS Storage Capacity Usage(Weekly) HDFS存储容量已使用量（每周）	This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified threshold within a week period. 一周内存储增量偏移值超过设定阈值则触发此服务级告警	20%
HDFS Storage Capacity Usage(Daily) HDFS存储使用量（每天）	This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified threshold within a day period. 一天中存储容量使用率增量超过特定阈值时触发此服务级告警	50%
DataNode Unmounted Data Dir DataNode 未安装的数据目录	This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became unmounted. If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the root partition, which is undesirable. 主机上有数据目录在原先挂载点上已卸载，则触发此主机级告警。若安装历史文件不存在，主机在根分区上具有一个或多个安装的数据目录以及一个或多个未安装的数据目录则报错。这意味着数据目录正在写入根分区，这是不可取的	2分钟
JournalNode Web UI	This host-level alert is triggered if the JournalNode Web UI is unreachable. 不能访问 JournalNode Web UI 时触发此主机级告警	Connection failed to {1} ({3})
DataDode Process DataNode进程	This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on the network. 不能启动单个DataNode进程以及不能在网络上监听单个DataNode进程时触发此主机级告警	5
DataNode Web UI	This host-level alert is triggered if the DataNode Web UI is unreachable. 不能访问DataNode Web UI 时触发此主机级告警	Connection failed to {1} ({3})
DataNode Storage DataNode存储	This host-level alert is triggered if storage capacity if full on the DataNode. It checks the DataNode JMX Servlet for the Capacity and Remaining properties. The threshold values are in percent. DataNode上存储容量满时触发此主机级告警。会检查DataNode JMX服务上已存储和可存储容量。阈值以百分比形式展示	80%
DataNode Heap Usage DataNode堆使用情况	This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are in percent. DataNode上堆使用情况超过设定阈值时触发此主机级告警。会检查DataNode JMX服务中已使用堆及堆最大量情况。阈值以百分比形式展示	90%
HDFS Pending Deletion Blocks HDFS所挂起的删除块	This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property. HDFS中待删除块的量超过所配置的警告和告警阈值，则触发此服务级告警。会检查NameNode JMX 中的挂起的需删除块的数量	100000
NameNode Client RPC Queue Latency(Daily) NameNode客户端RPC队列延迟（每天）	This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified threshold within a day period. 在一天中客户端接口的RPC延迟偏移量增长率超过设定阈值时触发此服务级告警	200%
NameNode client RPC Queue Latency (Hourly) NameNode客户端RPC队列延迟（每小时）	This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified threshold within an hour period. 一小时中客户端接口的RPC队列延迟增长率超过特定阈值时触发此服务级告警	200%
NameNode Client RPC Processing Latency（Daily） NameNode客户端RPC进程延迟（每天）	This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified threshold within a day period. 一天中客户端接口的RPC延迟增量增长率超过设定阈值时触发此服务级告警	200%
NameNode Client RPC Latency (Hourly) NameNode客户端RPC延迟（每小时）	This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified threshold within an hour period. 一小时中客户端接口RPC延迟偏移量增长率超过特定阈值时触发此服务级告警	200%
DataNode Health Summary DataNode 健康状态	This service-level alert is triggered if there are unhealthy DataNodes. 有不健康的DataNode时触发此服务级告警	1
HDFS Upggade Finalized State HDFS升级完成状态	This service-level alert is triggered if HDFS is not in the finalized state. HDFS不在完成状态时触发此服务级告警	1
NameNode Blocks Health NameNode块健康状态	This service-level alert is triggered if there are unhealthy DataNodes 有不健康的DataNode时触发此服务级告警	1
NameNode Web UI	This host-level alert is triggered if the NameNode Web UI is unreachable.<>br不能访问NameNode Web UI 时触发此主机级告警	Connection failed to {1} ({3})
NameNode Heap Usage (Weekly) NameNode 堆使用（每周）	This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold within a week period. 一周中NameNode 堆使用偏移量增长率超特定阈值时触发此服务级告警	50%
NameNode Heap Usage (Daily) NameNode堆使用（每天）	This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold within a day period. 一天中NameNode堆使用增量增长率超过所设定阈值时触发此服务级告警	50%
NameNode Last Checkpoint NameNode最后检查	This service-level alert will trigger if the last time that the NameNode performed a checkpoint was too long ago. It will also trigger if the number of uncommitted transactions is beyond a certain threshold. 距上次NameNode检查时间太长时触发此服务级告警。未提交事务超过某个阈值时也会触发此告警	200%
NameNode RPC Latency NameNode RPC延迟	This host-level alert is triggered if the NameNode RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations. The threshold values are in milliseconds. NameNode RPC延迟超过所设定阈值时触发此主机级告警。比较典型的是RPC进程时间增加会增加RPC队列长度，造成NameNode操作平均队列等待时间增长。阈值以毫秒级计	500
HDFS Capacity Utilization HDFS容量使用	This service-level alert is triggered if the HDFS capacity utilization exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties. The threshold values are in percent. HDFS容量使用超过设定警告和告警阈值时触发此服务级告警。会检查NameNode JMX中的容量使用和容量存留。阈值以百分比形式展示	80%
NameNode Directory Status NameNode文档状态	This host-level alert is triggered if the NameNode NameDirStatuses metric (name=NameNodeInfo/NameDirStatuses) reports a failed directory. The threshold values are in the number of directories that are not healthy. NameNode 名字文档中有失效文档时触发此主机级告警。阈值是不健康文档数量	1
NameNode Host CPU Utilization NameNode主机CPU使用	This host-level alert is triggered if CPU utilization of the NameNode exceeds certain warning and critical thresholds. It checks the NameNode JMX Servlet for the SystemCPULoad property. The threshold values are in percent. NameNode的CPU使用超过警告和告警阈值时触发此主机级告警。会检查NameNode JMX中的系统CPU载入量。阈值以百分比形式展示	250%
Zookeeper Failover Controller Process zookeeper故障转移控制进程	This host-level alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network. zookeeper故障转移控制进程不能被确认已启动或被网络监听时触发此主机级告警	6
NameNode High Availability Health NameNode高可用健康状态	This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running. 主节点或备用节点的NameNode都不运行时触发此服务级告警	1
Percent DataNodes Available DataNode可用百分比	This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. It aggregates the results of DataNode process checks. 集群中挂掉的DataNode数目超过所设定阈值时触发此告警。这会聚合DataNode进程检查结果	30%
Percent DataNodes With Available Sapce DataNode中可用空间百分比	This service-level alert is triggered if the storage on a certain percentage of DataNodes exceeds either the warning or critical threshold values. 一定比例的DataNode中存储量超过警告或告警阈值时触发此服务级告警	30%
Percent JournalNodes Available JournalNode可用百分比	This alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold. It aggregates the results of JournalNode process checks. 集群中所挂掉JournalNode数目多于所设定阈值时触发此告警。会聚合JournalNode进程检查结果	50%
NameNode Service RPC Processing Latency(Daily) NameNode服务RPC进程延迟（每天）	This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within a day period. 一天中DataNode接口的RPC延迟偏移量增长率超过设定阈值时触发此主机级告警	200%
NameNode Service RPC Processing Latency (Hourly) NameNode服务RPC延迟（每小时）	This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within an hour period. 一小时中DataNode接口总的RPC延迟偏移量超过所设定阈值时触发此服务级告警	200%
NameNode Service RPC Queue Latency(Daily) NameNode服务RPC队列延迟（每天）	This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within a day period. 一天中datanote接口中RPC延迟偏移增长率超过设定阈值时触发此服务级告警	200%
NameNode Service RPC Queue Latency(Hourly) NameNode服务RPC队列延迟（每小时）	This service-level alert is triggered if the deviation of RPC queue latency on datanode port has grown beyond the specified threshold within an hour period. 一小时中datanote接口中RPC队列延迟偏移增长率超过设定阈值时触发此服务级告警	200%
Secondary NameNode Process NameNode副本进程	This host-level alert is triggered if the Secondary NameNode process cannot be confirmed to be up and listening on the network. NameNode副本进程不能被确认已启动或被网络监听时触发此主机级告警	Connection failed to {1} ({3})
NFS Gateway Process NFS网关进程	This host-level alert is triggered if the NFS Gateway process cannot be confirmed to be up and listening on the network. NFS网关进程不能被确认已启动或被网络监听时触发此主机级告警	5

3.2 YARN

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
NodeManager Web UI	This host-level alert is triggered if the NodeManager Web UI is unreachable. 不能访问 NodeManager Web UI时触发此主机级告警	Connection failed to {1} ({3})
NodeManager Health NodeManager健康状态	This host-level alert checks the node health property available from the NodeManager component. 此主机级告警检查NodeManager组件中的节点健康状态	1
ResourceManager Web UI	This host-level alert is triggered if the ResourceManager Web UI is unreachable. 不能访问 ResourceManager Web UI时触发此主机级告警	Connection failed to {1} ({3})
ResourceManager CPU Utilization ResourceManager CPU 使用情况	This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain warning and critical thresholds. It checks the ResourceManager JMX Servlet for the SystemCPULoad property. The threshold values are in percent. ResourceManager CPU 使用增长率超过警告及告警阈值时触发此主机级告警。会检查ResourceManager JMX的系统CPU负载能力。阈值以百分比形式展示	250%
ResourceManager RPC Latency ResourceManager RPC 延迟	This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for ResourceManager operations. The threshold values are in milliseconds. ResourceManager RPC 延迟超过设定告警阈值时触发此主机级告警。典型情况下增加RPC进程时间会增加RPC队列长度，使ResourceManager操作的平均队列等待时间增加。此阈值为毫秒级	5000
NodeManager Health Summary NodeManager健康状态	This service-level alert is triggered if there are unhealthy NodeManagers 有不健康的NodeManager时触发此服务级高级	1
Percent NodeManagers Available 可用NodeManager百分比	This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold. It aggregates the results of NodeManager process checks. 集群中挂掉的NodeManager数量超过所设定告警阈值时触发此告警。会聚合NodeManager进程检查结果	30%
App Timeline Web UI	This host-level alert is triggered if the App Timeline Server Web UI is unreachable. 不能访问App Timeline Server Web UI时触发此主机级告警	Connection failed to {1} ({3})
Failed Apps Check 失败的App检查	This service-level alert is triggered if failed yarn apps is beyond the specified threshold within a given time span. 在给定时间内失败的yarn app数超过阈值时触发此服务级告警	2

3.3 MapReduce2

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
History Server Web UI 历史服务器Web UI	This host-level alert is triggered if the History Server Web UI is unreachable. 不能访问历史服务器Web UI时触发此主机级告警	Connection failed to {1} ({3})
History Server RPC Latency 历史服务器RPC延迟	This host-level alert is triggered if the History Server operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for operations. The threshold values are in milliseconds. 历史服务器RPC延迟超过设定阈值时触发此主机级告警。一般增减RPC进程时间会增加RPC队列长度，使操作的平均队列等待时间增加。阈值为毫秒级	5000
History Server CPU Utilization 历史服务器CPU使用情况	This host-level alert is triggered if the percent of CPU utilization on the History Server exceeds the configured critical threshold. The threshold values are in percent. 历史服务器的CPU使用百分比超过阈值时触发此主机级告警。阈值以百分比形式展示	250%
History Server Process 历史服务器进程	This host-level alert is triggered if the History Server process cannot be established to be up and listening on the network. 历史服务器进程不能被启动或从网络监听时会触发此主机级告警	5

3.4 Hive

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
WebHCat Server Status WebHCat 服务器状态	This host-level alert is triggered if the templeton server status is not healthy. templeton 服务器状态不健康时触发此主机级告警	5
HiveServer2 Interactive Process HiveServer2 交互过程	This host-level alert is triggered if the HiveServerInteractive cannot be determined to be up and responding to client requests. Hive服务器交互不能确认已启动和响应客户端时会触发此主机级告警	60
Hive MetaStore Process Hive元数据过程	This host-level alert is triggered if the Hive Metastore process cannot be determined to be up and listening on the network. Hive元数据过程不能确定已启动和从网络进行监听时触发此主机级告警	60
LLAP Application LLAP应用	This alert is triggered if the LLAP Application cannot be determined to be up and responding to requests. LLAP应用不能确定已启动和响应客户端时触发此主机级告警	120
HiveServer2 Process HiveServer2 进程	This host-level alert is triggered if the HiveServer cannot be determined to be up and responding to client requests. Hive服务器不能确定已启动和响应客户端时触发此主机级告警	60

3.5 HBase

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
HBase RegionServer Process HBase RegionServer 进程	This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds. HBase RegionServer 进程不能确认已启动和在给定阈值（秒级）下从网络监听时触发此主机级告警	5
HBase Master Process HBase 主节点进程	This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds. HBase 主节点进程不能确认已启动和在给定阈值（秒级）下从网络监听时触发此主机级告警	5
Percent RegionServers Available RegionServer可用百分比	This service-level alert is triggered if the configured percentage of RegionServer processes cannot be determined to be up and listening on the network for the configured warning and critical thresholds. It aggregates the results of RegionServer process down checks. 所配置一定百分比的RegionServer进程不能确认已启动或从网络监听时触发此服务级告警。这会聚合RegionSever进程失败检查结果	30%
HBase Mater CPU Utilization HBase主节点CPU使用情况	This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain warning and critical thresholds. It checks the HBase Master JMX Servlet for the SystemCPULoad property. The threshold values are in percent. HBase主节点上CPU使用超过所设置的警告及告警阈值会触发此主机级告警。这会检查HBase 主节点JMX中的系统CPU负荷情况。阈值以百分比形式展示	250%

3.6 Oozie

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
Oozie Server Web UI	This host-level alert is triggered if the Oozie server Web UI is unreachable. 不能访问此Oozie服务器Web UI时触发此告警	Connection failed to {1} ({3})
Oozie Server Status Oozie Server状态	This host-level alert is triggered if the Oozie server cannot be determined to be up and responding to client requests. Oozie Server不能确定已启动和不响应客户端请求时触发此主机级告警	1

3.7 Zookeeper

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
ZooKeeper Server Process ZooKeeper服务器进程	This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the network. ZooKeeper服务器进程不能确认已启动和从网络监听时触发此主机级告警	5
Percent ZooKeeper Servers Available Zookeeper服务器可用百分比	This alert is triggered if the number of down ZooKeeper servers in the cluster is greater than the configured critical threshold. It aggregates the results of ZooKeeper process checks. 集群中所挂掉Zookeeper服务器数量大于所配置阈值时触发此告警。会对Zookeeper进程检查结果进行聚合	70%

3.8 Storm

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
Supervisor Process		5
Percent Supervisors Available		30%
Storm Web UI		Connection failed to {1} ({3})
Nimbus Process		5
DRPC Server Process		5

3.9 Kafka

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
Kafka Broker Process Kafka Broker进程	This host-level alert is triggered if the Kafka Broker cannot be determined to be up. Kafka Broker进程不能确定是否已启动时触发此主机级告警	5

3.10 Spark2

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
Spark2 Livy Server Spark2 Livy服务器	This host-level alert is triggered if the Livy2 Server cannot be determined to be up. Livy2 服务器不能确认已启动时触发此主机级告警	60
Spark2 History Server Spark2 历史服务器	This host-level alert is triggered if the Spark2 History Server cannot be determined to be up. Spark2 历史服务不能确认已启动时触发此主机级告警	5
Spark2 Thrift Server Spark2 Thrift 服务器	This host-level alert is triggered if the Spark2 Thrift Server cannot be determined to be up. Spark2 Thrift 服务不能确认已启动时触发此主机级告警	60

3.11 ElasticSearch

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
ElasticSearch Process Check ElasticSearch 进程检查	This host-level alert is triggered if the ElasticSearch Master cannot be determined to be up. ElasticSearch 主节点不能确定已启动时触发此主机级告警	5

3.12 Hue

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
Hue Web UI	This host-level alert is triggered if the Hue Web UI is unreachable. 不能访问Hue Web UI时触发此主机级告警	5

3.13 Ambari

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
Host Disk Usage 主机硬盘使用情况	This host-level alert is triggered if the amount of disk space used goes above specific thresholds. The default threshold values are 50% for WARNING and 80% for CRITICAL. 主机硬盘使用率超过阈值时触发此主机级告警。阈值默认为：警告：50%，告警：80%	80%
Ambari Agent Distro/conf Select Versions Ambari 客户端Distro/conf版本选择	This host-level alert is triggered if the distro selector such as hdp-select cannot calculate versions available on this host. This may indicate that `/usr/$stack/` directory has links/dirs that do not belong inside of it. 主机上没有所选择distro版本（如hdp选择）时触发此主机级版告警。这可能时因为在 /usr/$stack/目录下含不属于其链接的目录	5
Host Disk Usage For Dir '/' 主机硬盘使用目录‘/’	This host-level alert is triggered if the amount of disk space used goes above specific thresholds. The default threshold values are 80% for WARNING and 90% for CRITICAL. 硬盘使用量超过阈值时触发此主机级告警。阈值默认为：警告：80%，告警：90%	5.0E9 bytes
Host Disk Usage For Dir '/mnt’ 主机硬盘使用目录‘/mnt’	This host-level alert is triggered if the amount of disk space used goes above specific thresholds. The default threshold values are 80% for WARNING and 90% for CRITICAL. 硬盘使用量超过阈值时触发此主机级告警。阈值默认为警告：80%，告警：90%	5.0E9 bytes
Ambari Agent Heartbeat Ambari客户端心跳	This alert is triggered if the server has lost contact with an agent. 服务器未收到客户端心跳时出发此告警	2
Ambari Server Alerts Ambari 服务器告警	This alert is triggered if the server detects that there are alerts which have not run in a timely manner. 如果服务器检测到没有及时运行的警报时触发此警报	2
Ambari Server Performance Ambari服务器性能	This alert is triggered if the Ambari Server detects that there is a potential performance problem with Ambari. This type of issue can arise for many reasons, but is typically attributed to slow database queries and host resource exhaustion. Ambari服务器检测Amabri有潜在运行问题时触发此告警。有很多因素都有可能导致这个问题，但最常见的是由于数据库查询缓慢以及主机资源耗尽	5000
component Version 组件版本	This alert is triggered if the server detects that there is a problem with the expected and reported version of a component. The alert is suppressed automatically during an upgrade. 服务器检测到组件有版本问题时触发此告警。在组件升级时常会触发此告警	5

3.14 Ambari Metrics

告警定义名称 (Alert Definition Name)	描述 (Description)	危险-默认值 (CRITICAL)
Metrics Monitor Status Metrics 监控状态	This alert indicates the status of the Metrics Monitor process as determined by the monitor status script. 此告警指示监控器状态脚本所确定的Metrics监控进程状态	1
Metrics Collector Process Metrics收集器进程	This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for number of seconds equal to threshold. Metrics收集器不能确认已启动或在数秒内监听到的次数没有达到所配置阈值次数时触发此告警	5
Metrics collector - HBase CPU Utilization Metrics收集器的HBase CPU使用	This host-level alert is triggered if CPU utilization of the Metrics Collector’s HBase Master exceeds certain warning and critical thresholds. It checks the HBase Master JMX Servlet for the SystemCPULoad property. The threshold values are in percent. Metrics收集器的HBase主节点的CPU使用超过警告和告警阈值。回检查HBase主节点JMX的系统CPU加载情况。阈值以百分比形式展示	250%
Metrics Collector - Auto-Restart Status Metrics 收集器自动重启状态	This alert is triggered if the Metrics Collector has been restarted automatically too frequently in last one hour. By default, a Warning alert is triggered if restarted twice in one hour and a Critical alert is triggered if restarted 4 or more times in one hour. 在最后一小时中Metrics收集器自动重启太频繁会触发此告警。一小时中两次重启系统警告，4次重启系统告警	Metrics Collector has been auto-started {1} times{0}.
Percent Metrics Monitors Available Metrics监控器可用百分比	This alert is triggered if a percentage of Metrics Monitor processes are not up and listening on the network for the configured warning and critical thresholds. 一定比例（所配置的警告和告警阈值）的Metics 监控器未启动或从不能从网络监听到	30%
Metrics Collector - HBase Master Process Metrics收集器的HBase 主节点进程	This alert is triggered if the Metrics Collector’s HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds. 在给定时间内（秒）Metics收集器的HBase主节点进程不能确认已启动或从网络监听到触发此报警	5
Grafana Web UI	This host-level alert is triggered if the Grafana Web UI is unreachable. 不能访问Grafana Web UI时触发此告警	5

参考：https://docs.ksyun.com/documents/5812

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/weixin_40725706/article/detail/683634