赞
踩
上文讲到了prometheus+grafana对于大数据集群的监控。但是随着集群规模越来越大,prometheus压力也随之增大,因为采取拉去方式,对于prometheus本身的压力比较大。那么程序本身有什么解决办法?其他监控采取什么方式解决的。
熟悉zabbix的朋友可能知道,zabbix中有主动模式和被动模式,主动模式可以实现agent节点自动向server节点汇报,这样就减轻了server端的压力。被动模式中也有一种添加代理节点方式实现分布式监控,实现跨机房异地监控目标。具体方法可以参考本人zabbix监控文章。
prometheus的分布式类似于nginx的负载均衡模式,主节点配置文件可以配置从节点的地址池,主节点只要定时向从节点拉取数据即可,主节点的作用就是存贮实时数据,并提供给grafana 使用。而从节点作用就是分别从不同的各个采集端中抽取数据,可以实现分机器或者分角色。这种由一个中心的prometheus负则聚合多个prometheus数据中的监控模式,称为prometheus联邦集群。
例如:集群规模200台,两个从节点,可以每台机器监控100台,也可以每台机器监控200台,但是分别监控不同角色,第一个从节点监控hdfs,第二个节点监控hbase这种方式,反正想怎么监控就看个人配置了。
分布式的部署就是找多台机器分别部署prometheus,部署方式都是一致,只有配置文件不同。
联邦集群核心在于每一个prometheus server都包含一个用于获取当前实例中监控样本的接口 /federate 。对于中心prometheus server无论是从其他prometheus实例还是node_exporter采集端获取数据,事实上没有任何差异的。
参数 | 作用 |
---|---|
honor_labels | 防止采集到监控指标冲突,配置true可以确保采集到指标冲突时自动忽略冲突指标;配置false会自动将冲突指标替换为exported_的形式。还可以添加标签区分不同监控目标 |
metrics_path | 联邦集群用于获取监控样本参数配置 /federate |
match[ ] | 指定需要获取的时间序列,个人认为也就是填写从节点的角色标签或者环境变量。可以填写job="zookeeper"或者__name__=~“instance.*”,模糊匹配可以使用通配符。将你想要展示的角色或者变量写入prometheus主节点才可以获取从节点上信息,否则无法获取 |
static_configs | 在此填写从节点地址池即可 |
以下为主节点配置文件:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. # - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. # static_configs: # - targets: ['localhost:9090'] - job_name: 'node_workers' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="zookeeper"}' - '{job="hbase"}' - '{job="hdfs"}' - '{job="hive"}' - '{job="kafka"}' - '{job="linux_server"}' - '{job="spark"}' - '{job="yarn"}' - '{__name__=~"instance.*"}' static_configs: - targets: - '192.168.1.1:9090' - '192.168.1.2:9090'
从节点配置文件:
从节点1:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] - job_name: 'linux_server' file_sd_configs: - files: - configs/linux.json - job_name: 'hdfs' file_sd_configs: - files: - configs/hdfs.json - job_name: 'hbase' file_sd_configs: - files: - configs/hbase.json - job_name: 'yarn' file_sd_configs: - files: - configs/yarn.json
从节点2:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] - job_name: 'zookeeper' file_sd_configs: - files: - configs/zookeeper.json - job_name: 'hive' file_sd_configs: - files: - configs/hive.json - job_name: 'kafka' file_sd_configs: - files: - configs/kafka.json - job_name: 'spark' file_sd_configs: - files: - configs/spark.json
prometheus默认是通过本地存贮方式的,这样可以减少管理的复杂性和异地存贮带来的网络带宽影响等。当然本地存储也带来了一些不好的地方,首先就是数据持久化的问题,特别是在像Kubernetes这样的动态集群环境下,如果Promthues的实例被重新调度,那所有历史监控数据都会丢失。 其次本地存储也意味着Prometheus不适合保存大量历史数据(一般Prometheus推荐只保留几周或者几个月的数据)。最后本地存储也导致Prometheus无法进行弹性扩展。为了适应这方面的需求,Prometheus提供了remote_write和remote_read的特性,支持将数据存储到远端和从远端读取数据。通过将监控样本采集和数据存储分离,解决Prometheus的持久化问题。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。