随着Kafka的演进,Kafka自己也变成了一个复杂的分布式系统,它和zookeeper一样,都对外提供一致性服务。Kafka在其系统内再维护一套zookeeper分布式系统,这本身就是个吃力不讨好的工作,更别提zookeeper的各种问题、限制和瓶颈。所以,Kafka的开发者提出了Kafka without zookeeper,将zookeeper踢出Kakfa系统,Kafka自己负责管理各种信息、数据。文章还起了个致敬祖师爷的标题:Apache Kafka Made Simple
开发者管这个新的模式叫Kafka Raft Metadata mode
。我猜他们也发现了,zab比起paxos,还是更像raft吧:) 该模式的Early Access已经提交到Kafka分支中,预计将发布在Kakfa 2.8版本中。
著名的zookeeper客户端库Curator专门总结了使用Zookeeper的Tech notes,我选择一些重要的翻译如下:
事件,也就是说curator目前没有连接到任何的zk server),leader选举、分布式锁等操作遇到SUSPENED
When Curator receives a KeeperState.Disconnected message it changes its state to SUSPENDED (see TN12, errors, etc.). As always, our recommendation is to treat SUSPENDED as a complete connection loss. Exit all locks, leaders, etc. That said, since 3.x, Curator tries to simulate session expiration by starting an internal timer when KeeperState.Disconnected is received. If the timer expires before the connection is repaired, Curator changes its state to LOST and injects a session end into the managed ZooKeeper client connection. The duration of the timer is set to the value of the “negotiated session timeout” by calling ZooKeeper#getSessionTimeout().
The astute reader will realize that setting the timer to the full value of the session timeout may not be the correct value. This is due to the fact that the server closes the connection when 2/3 of a session have already elapsed. Thus, the server may close a session well before Curator’s timer elapses. This is further complicated by the fact that the client has no way of knowing why the connection was closed. There are at least three possible reasons for a client connection to close:
- The server has not received a heartbeat within 2/3 of a session
- The server crashed
- Some kind of general TCP error which causes a connection to fail
In situtation 1, the correct value for Curator’s timer is 1/3 of a session - i.e. Curator should switch to LOST if the connection is not repaired within 1/3 of a session as 2/3 of the session has already lapsed from the server’s point of view. In situations 2 and 3 however, Curator’s timer should be the full value of the session (possibly plus some “slop” value). In truth, there is no way to completely emulate in the client the session timing as managed by the ZooKeeper server. So, again, our recommendation is to treat SUSPENDED as complete connection loss.
curator默认使用100%的session timeout时间作为SUSPENDED到LOST的转换时间,但是用户可以根据需求配置为33%的session timeout以满足上文所说的情况的场景
kafka将其引入的共识协议称为Event-driven consensus
,controller节点内部维护RSM(replicated state machine),而不像之前的zookeeper-based,节点需要首先访问zookeeper获取状态信息。Kafka的元数据会通过raft一致性协议写入quorum,并且系统会定期做snapshot。
不同于之前的Kafka集群,唯一的Controller从所有的brker中选出,负责Watch Zookeeper、partition的replica的集群分配,以及leader切换选举等流程。KRaft
中Controller可以被指定为奇数个节点(一般情况下3或5)组成raft quorum。controller节点中有一个active(选为leader),其他的hot standby。这个controller集群负责管理Kafka集群的元数据,通过raft协议达成共识。因此,每个controller都拥有几乎update-to-date的Metadata,所以controller集群重新选主时恢复时间很短。
获取controller。不同于之前的模式,controller发送Metadata给其他的broker。现在broker需要主动向active controller拉取Metadata。一旦broker收到Metadata,它会将其持久化。这个broker持久化Metadata的优化意味着一般情况下active controller不需要向broker发送完整的Metadata,只需要从某个特定的offset发送即可。但如果遇到一个新上线的broker,Controller可以发送snapshot给broker(类似raft的InstallSnapshot RPC)。
