Kafka 官方对于自身的 LAG 监控并没有太好的方法,虽然Kafka broker 自带有 kafka-topic.sh, kafka-consumer-groups.sh, kafka-console-consumer.sh 等脚本,但是对于大规模的生产集群上,使用脚本采集是非常不可靠的。
LinkedIn 公司的数据基础设施Streaming SRE团队正在积极开发Burrow,该软件由Go语言编写,在Apache许可证下发布,并托管在 GitHub Burrow 上。
这些信息被分解成每个分区的状态,然后转化为Consumer的单一状态。消费状态可以是OK,或处于WARNING状态(Consumer正在工作但消息消费落后),或处于ERROR状态(Consumer已停止消费或离线)。此状态可通过简单的HTTP请求发送至Burrow获取状态,也可以通过Burrow 定期检查并使用通知其通过电子邮件或单独的HTTP endpoint接口(例如监视或通知系统)发送出去。
Burrow能够监控Consumer消费消息的延迟,从而监控应用的健康状况,并且可以同时监控多个Kafka集群。用于获取关于Kafka集群和消费者的信息的HTTP上报服务与滞后状态分开,对于在无法运行Java Kafka客户端时有助于管理Kafka集群的应用程序非常有用。
Burrow 是基于 Go 语言开发,当前 Burrow 的 v1.1 版本已经release。
Burrow 也提供用于 docker 镜像。
Burrow_1.2.2_checksums.txt 297 Bytes
Burrow_1.2.2_darwin_amd64.tar.gz 4.25 MB
Burrow_1.1.0_linux_amd64.tar.gz 3.22 MB (CentOS 6)
Burrow_1.2.2_linux_amd64.tar.gz 4.31 MB (CentOS 7 Require GLIBC >= 2.14)
Burrow_1.2.2_windows_amd64.tar.gz 4 MB
Burrow 是无本地状态存储的,CPU密集型,网络IO密集型应用。
- # wget https://github.com/linkedin/Burrow/releases/download/v1.1.0/Burrow_1.1.0_linux_amd64.tar.gz
- # mkdir burrow
- # tar -xf Burrow_1.1.0_linux_amd64.tar.gz -C burrow
- # cp burrow/burrow /usr/bin/
- # mkdir /etc/burrow
- # cp burrow/config/* /etc/burrow/
- # chkconfig --add burrow
- # /etc/init.d/burrow start
- [general]
- pidfile="/var/run/burrow.pid"
- stdout-logfile="/var/log/burrow.log"
- access-control-allow-origin="mysite.example.com"
- [logging]
- filename="/var/log/burrow.log"
- level="info"
- maxsize=512
- maxbackups=30
- maxage=10
- use-localtime=true
- use-compression=true
- [zookeeper]
- servers=[ "test1.localhost:2181","test2.localhost:2181" ]
- timeout=6
- root-path="/burrow"
- [client-profile.prod]
- client-id="burrow-lagchecker"
- kafka-version="0.10.0"
- [cluster.production]
- class-name="kafka"
- servers=[ "test1.localhost:9092","test2.localhost:9092" ]
- client-profile="prod"
- topic-refresh=180
- offset-refresh=30
- [consumer.production_kafka]
- class-name="kafka"
- cluster="production"
- servers=[ "test1.localhost:9092","test2.localhost:9092" ]
- client-profile="prod"
- start-latest=false
- group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$"
- group-whitelist=""
- [consumer.production_consumer_zk]
- class-name="kafka_zk"
- cluster="production"
- servers=[ "test1.localhost:2181","test2.localhost:2181" ]
- #zookeeper-path="/"
- # If specified, this is the root of the Kafka cluster metadata in the Zookeeper ensemble. If not specified, the root path is used.
- zookeeper-timeout=30
- group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$"
- group-whitelist=""
- [httpserver.default]
- address=":8000"
- [storage.default]
- class-name="inmemory"
- workers=20
- intervals=15
- expire-group=604800
- min-distance=1
- #[notifier.default]
- #class-name="http"
- #url-open=""
- #interval=60
- #timeout=5
- #keepalive=30
- #extras={ api_key="REDACTED", app="burrow", tier="STG", fabric="mydc" }
- #template-open="/etc/burrow/default-http-post.tmpl"
- #template-close="/etc/burrow/default-http-delete.tmpl"
- #method-close="DELETE"
- #send-close=false
- ##send-close=true
- #threshold=1

- #!/bin/bash
- #
- # Comments to support chkconfig
- # chkconfig: - 98 02
- # description: Burrow is kafka lag check_program by LinkedIn, Inc.
- #
- # Source function library.
- . /etc/init.d/functions
- ### Default variables
- prog_name="burrow"
- prog_path="/usr/bin/${prog_name}"
- pidfile="/var/run/${prog_name}.pid"
- options="-config-dir /etc/burrow/"
- # Check if requirements are met
- [ -x "${prog_path}" ] || exit 1
- start(){
- echo -n $"Starting $prog_name: "
- #pidfileofproc $prog_name
- #killproc $prog_path
- PID=$(pidofproc -p $pidfile $prog_name)
- #daemon $prog_path $options
- if [ -z $PID ]; then
- $prog_path $options > /dev/null 2>&1 &
- [ ! -e $pidfile ] && sleep 1
- fi
- [ -z $PID ] && PID=$(pidof ${prog_path})
- if [ -f $pidfile -a -d "/proc/$PID" ]; then
- #RETVAL=$?
- #[ ! -z "${PID}" ] && echo ${PID} > ${pidfile}
- echo_success
- [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name
- else
- echo_failure
- fi
- echo
- return $RETVAL
- }
- stop(){
- echo -n $"Shutting down $prog_name: "
- killproc -p ${pidfile} $prog_name
- echo
- [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name
- return $RETVAL
- }
- restart() {
- stop
- start
- }
- case "$1" in
- start)
- start
- ;;
- stop)
- stop
- ;;
- restart)
- restart
- ;;
- status)
- status $prog_path
- ;;
- *)
- echo $"Usage: $0 {start|stop|restart|status}"
- esac
- exit $RETVAL

默认配置文件为 burrow.toml
GET /v3/kafka/(cluster)/consumer
Burrow 返回额接口均为 json 对象格式,所以非常方便用于二次采集处理。
- GET /v3/kafka/(cluster)/consumer/(group)/status
- GET /v3/kafka/(cluster)/consumer/(group)/lag
GET /v3/kafka/(cluster)/topic
GET /v3/kafka/(cluster)/topic/(topic)
