赞
踩
Kafka源码包含多个模块,每个模块负责不同的功能。以下是一些核心模块及其功能的概述:
服务端源码 :实现Kafka Broker的核心功能,包括日志存储、控制器、协调器、元数据管理及状态机管理、延迟机制、消费者组管理、高并发网络架构模型实现等。
Java客户端源码 :实现了Producer和Consumer与Broker的交互机制,以及通用组件支撑代码。
Connect源码 :用来构建异构数据双向流式同步服务。
Stream源码 :用来实现实时流处理相关功能。
Raft源码 :实现了Raft一致性协议。
Admin模块 :Kafka的管理员模块,操作和管理其topic,partition相关,包含创建,删除topic,或者拓展分区等。
Api模块 :负责数据交互,客户端与服务端交互数据的编码与解码。
Client模块 :包含Producer读取Kafka Broker元数据信息的类,如topic和分区,以及leader。
Cluster模块 :包含Broker、Cluster、Partition、Replica等实体类。
Common模块 :包含各种异常类以及错误验证。
Consumer模块 :消费者处理模块,负责客户端消费者数据和逻辑处理。
Controller模块 :负责中央控制器的选举,分区的Leader选举,Replica的分配或重新分配,分区和副本的扩容等。
Coordinator模块 :负责管理部分consumer group和他们的offset。
Javaapi模块 :提供Java语言的Producer和Consumer的API接口。
Log模块 :负责Kafka文件存储,读写所有Topic消息数据。
Message模块 :封装多条数据组成数据集或压缩数据集。
Metrics模块 :负责内部状态监控。
Network模块 :处理客户端连接,网络事件模块。
Producer模块 :生产者细节实现,包括同步和异步消息发送。
Security模块 :负责Kafka的安全验证和管理。
Serializer模块 :序列化和反序列化消息内容。
Server模块 :涉及Leader和Offset的checkpoint,动态配置,延时创建和删除Topic,Leader选举,Admin和Replica管理等。
Tools模块 :包含多种工具,如导出consumer offset值,LogSegments信息,Topic的log位置信息,Zookeeper上的offset值等。
Utils模块 :包含各种工具类,如Json,ZkUtils,线程池工具类,KafkaScheduler公共调度器类等。
这些模块共同构成了Kafka的整体架构,使其能够提供高吞吐量、高可用性的消息队列服务。
KafkaApis类的handleFetchRequest()方法作为api入口:
/** * Handle a fetch request */ //处理消费者消息消费请求、follwer副本同步请求 def handleFetchRequest(request: RequestChannel.Request) { val fetchRequest = request.body[FetchRequest] val versionId = request.header.apiVersion val clientId = request.header.clientId val unauthorizedTopicResponseData = mutable.ArrayBuffer[(TopicPartition, FetchResponse.PartitionData)]() val nonExistingTopicResponseData = mutable.ArrayBuffer[(TopicPartition, FetchResponse.PartitionData)]() val authorizedRequestInfo = mutable.ArrayBuffer[(TopicPartition, FetchRequest.PartitionData)]() //replicaId >= 0时表示是follwer副本同步请求 if (fetchRequest.isFromFollower() && !authorize(request.session, ClusterAction, Resource.ClusterResource)) for (topicPartition <- fetchRequest.fetchData.asScala.keys) unauthorizedTopicResponseData += topicPartition -> new FetchResponse.PartitionData(Errors.CLUSTER_AUTHORIZATION_FAILED, FetchResponse.INVALID_HIGHWATERMARK, FetchResponse.INVALID_LAST_STABLE_OFFSET, FetchResponse.INVALID_LOG_START_OFFSET, null, MemoryRecords.EMPTY) else for ((topicPartition, partitionData) <- fetchRequest.fetchData.asScala) { if (!authorize(request.session, Read, new Resource(Topic, topicPartition.topic))) unauthorizedTopicResponseData += topicPartition -> new FetchResponse.PartitionData(Errors.TOPIC_AUTHORIZATION_FAILED, FetchResponse.INVALID_HIGHWATERMARK, FetchResponse.INVALID_LAST_STABLE_OFFSET, FetchResponse.INVALID_LOG_START_OFFSET, null, MemoryRecords.EMPTY) else if (!metadataCache.contains(topicPartition.topic)) nonExistingTopicResponseData += topicPartition -> new FetchResponse.PartitionData(Errors.UNKNOWN_TOPIC_OR_PARTITION, FetchResponse.INVALID_HIGHWATERMARK, FetchResponse.INVALID_LAST_STABLE_OFFSET, FetchResponse.INVALID_LOG_START_OFFSET, null, MemoryRecords.EMPTY) else authorizedRequestInfo += (topicPartition -> partitionData) } def convertedPartitionData(tp: TopicPartition, data: FetchResponse.PartitionData) = { // Down-conversion of the fetched records is needed when the stored magic version is // greater than that supported by the client (as indicated by the fetch request version). If the // configured magic version for the topic is less than or equal to that supported by the version of the // fetch request, we skip the iteration through the records in order to check the magic version since we // know it must be supported. However, if the magic version is changed from a higher version back to a // lower version, this check will no longer be valid and we will fail to down-convert the messages // which were written in the new format prior to the version downgrade. replicaManager.getMagic(tp).flatMap { magic => val downConvertMagic = { if (magic > RecordBatch.MAGIC_VALUE_V0 && versionId <= 1 && !data.records.hasCompatibleMagic(RecordBatch.MAGIC_VALUE_V0)) Some(RecordBatch.MAGIC_VALUE_V0) else if (magic > RecordBatch.MAGIC_VALUE_V1 && versionId <= 3 && !data.records.hasCompatibleMagic(RecordBatch.MAGIC_VALUE_V1)) Some(RecordBatch.MAGIC_VALUE_V1) else None } downConvertMagic.map { magic => trace(s"Down converting records from partition $tp to message format version $magic for fetch request from $clientId") val converted = data.records.downConvert(magic, fetchRequest.fetchData.get(tp).fetchOffset, time) updateRecordsProcessingStats(request, tp, converted.recordsProcessingStats) new FetchResponse.PartitionData(data.error, data.highWatermark, FetchResponse.INVALID_LAST_STABLE_OFFSET, data.logStartOffset, data.abortedTransactions, converted.records) } }.getOrElse(data) } // the callback for process a fetch response, invoked before throttling def processResponseCallback(responsePartitionData: Seq[(TopicPartition, FetchPartitionData)]) { val partitionData = { responsePartitionData.map { case (tp, data) => val abortedTransactions = data.abortedTransactions.map(_.asJava).orNull val lastStableOffset = data.lastStableOffset.getOrElse(FetchResponse.INVALID_LAST_STABLE_OFFSET) tp -> new FetchResponse.PartitionData(data.error, data.highWatermark, lastStableOffset, data.logStartOffset, abortedTransactions, data.records) } } val mergedPartitionData = partitionData ++ unauthorizedTopicResponseData ++ nonExistingTopicResponseData val fetchedPartitionData = new util.LinkedHashMap[TopicPartition, FetchResponse.PartitionData]() mergedPartitionData.foreach { case (topicPartition, data) => if (data.error != Errors.NONE) debug(s"Fetch request with correlation id ${request.header.correlationId} from client $clientId " + s"on partition $topicPartition failed due to ${data.error.exceptionName}") fetchedPartitionData.put(topicPartition, data) } // fetch response callback invoked after any throttling def fetchResponseCallback(bandwidthThrottleTimeMs: Int) { def createResponse(requestThrottleTimeMs: Int): FetchResponse = { val convertedData = new util.LinkedHashMap[TopicPartition, FetchResponse.PartitionData] fetchedPartitionData.asScala.foreach { case (tp, partitionData) => convertedData.put(tp, convertedPartitionData(tp, partitionData)) } val response = new FetchResponse(convertedData, bandwidthThrottleTimeMs + requestThrottleTimeMs) response.responseData.asScala.foreach { case (topicPartition, data) => // record the bytes out metrics only when the response is being sent brokerTopicStats.updateBytesOut(topicPartition.topic, fetchRequest.isFromFollower, data.records.sizeInBytes) } response } if (fetchRequest.isFromFollower) sendResponseExemptThrottle(request, createResponse(0)) else sendResponseMaybeThrottle(request, requestThrottleMs => createResponse(requestThrottleMs)) } // When this callback is triggered, the remote API call has completed. // Record time before any byte-rate throttling. request.apiRemoteCompleteTimeNanos = time.nanoseconds if (fetchRequest.isFromFollower) { // We've already evaluated against the quota and are good to go. Just need to record it now. val responseSize = sizeOfThrottledPartitions(versionId, fetchRequest, mergedPartitionData, quotas.leader) quotas.leader.record(responseSize) fetchResponseCallback(bandwidthThrottleTimeMs = 0) } else { // Fetch size used to determine throttle time is calculated before any down conversions. // This may be slightly different from the actual response size. But since down conversions // result in data being loaded into memory, it is better to do this after throttling to avoid OOM. val response = new FetchResponse(fetchedPartitionData, 0) val responseStruct = response.toStruct(versionId) quotas.fetch.maybeRecordAndThrottle(request.session.sanitizedUser, clientId, responseStruct.sizeOf, fetchResponseCallback) } } if (authorizedRequestInfo.isEmpty) processResponseCallback(Seq.empty) else { // call the replica manager to fetch messages from the local replica //调用replica manager从本地日志副本中拉取消息 replicaManager.fetchMessages( fetchRequest.maxWait.toLong, //最长等待时间 fetchRequest.replicaId, //follower副本的brokerid fetchRequest.minBytes, //拉取请求设置的最小拉取字节 fetchRequest.maxBytes, //拉取请求设置的最大拉取字节 versionId <= 2, authorizedRequestInfo, replicationQuota(fetchRequest), processResponseCallback, //回调函数 fetchRequest.isolationLevel) //隔离级别,包含 READ_UNCOMMITTED, READ_COMMITTED; } }
ReplicaManager.fetchMessages()中会继续调用readFromLocalLog()方法:
/** * Fetch messages from the leader replica, and wait until enough data can be fetched and return; * the callback function will be triggered either when timeout or required fetch info is satisfied */ //从leader副本拉取数据,等待直到拉取到足够的数据,在超时或拉取数据满足条件时会触发回调函数 def fetchMessages(timeout: Long, replicaId: Int, fetchMinBytes: Int, fetchMaxBytes: Int, hardMaxBytesLimit: Boolean, fetchInfos: Seq[(TopicPartition, PartitionData)], quota: ReplicaQuota = UnboundedQuota, responseCallback: Seq[(TopicPartition, FetchPartitionData)] => Unit, isolationLevel: IsolationLevel) { //replicaId>=0时,表示是对应brokerId的follower副本的拉取请求。否则就是consumer的拉取请求 val isFromFollower = Request.isValidBrokerId(replicaId) //目前只能从leader拉取消息,因此fetchOnlyFromLeader始终为true val fetchOnlyFromLeader = replicaId != Request.DebuggingConsumerId //fetchOnlyCommitted=true表示拉取请求来自 consumer, 只能拉取 HW 以内的数据;如果请求是来自 Follower Replica 同步,则没有该限制(false)。 val fetchOnlyCommitted = !isFromFollower def readFromLog(): Seq[(TopicPartition, LogReadResult)] = { //获取本地日志 val result = readFromLocalLog( replicaId = replicaId, fetchOnlyFromLeader = fetchOnlyFromLeader, readOnlyCommitted = fetchOnlyCommitted, fetchMaxBytes = fetchMaxBytes, hardMaxBytesLimit = hardMaxBytesLimit, readPartitionInfo = fetchInfos, quota = quota, isolationLevel = isolationLevel) //如果请求来自follower副本,则需要更新follower相关的拉取状态 if (isFromFollower) updateFollowerLogReadResults(replicaId, result) else result } val logReadResults = readFromLog() // check if this fetch request can be satisfied right away val logReadResultValues = logReadResults.map { case (_, v) => v } val bytesReadable = logReadResultValues.map(_.info.records.sizeInBytes).sum val errorReadingData = logReadResultValues.foldLeft(false) ((errorIncurred, readResult) => errorIncurred || (readResult.error != Errors.NONE)) // respond immediately if 1) fetch request does not want to wait // 2) fetch request does not require any data // 3) has enough data to respond // 4) some error happens while reading data //若满足一下条件之一将会直接返回结果:timeout<=0、拉取的消息为空、拉取到了足够的数据、读取数据期间发生异常 if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) { val fetchPartitionData = logReadResults.map { case (tp, result) => tp -> FetchPartitionData(result.error, result.highWatermark, result.leaderLogStartOffset, result.info.records, result.lastStableOffset, result.info.abortedTransactions) } responseCallback(fetchPartitionData) } else { //未满足立即返回结果的情况,需延迟返回结果 // construct the fetch results from the read results val fetchPartitionStatus = logReadResults.map { case (topicPartition, result) => val fetchInfo = fetchInfos.collectFirst { case (tp, v) if tp == topicPartition => v }.getOrElse(sys.error(s"Partition $topicPartition not found in fetchInfos")) (topicPartition, FetchPartitionStatus(result.info.fetchOffsetMetadata, fetchInfo)) } val fetchMetadata = FetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit, fetchOnlyFromLeader, fetchOnlyCommitted, isFromFollower, replicaId, fetchPartitionStatus) val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, isolationLevel, responseCallback) // create a list of (topic, partition) pairs to use as keys for this delayed fetch operation val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => new TopicPartitionOperationKey(tp) } // try to complete the request immediately, otherwise put it into the purgatory; // this is because while the delayed fetch operation is being created, new requests // may arrive and hence make this operation completable. delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys) } } /** * Read from multiple topic partitions at the given offset up to maxSize bytes */ //从多个partition给定的offset中获取消息 def readFromLocalLog(replicaId: Int, fetchOnlyFromLeader: Boolean, readOnlyCommitted: Boolean, fetchMaxBytes: Int, hardMaxBytesLimit: Boolean, readPartitionInfo: Seq[(TopicPartition, PartitionData)], quota: ReplicaQuota, isolationLevel: IsolationLevel): Seq[(TopicPartition, LogReadResult)] = { def read(tp: TopicPartition, fetchInfo: PartitionData, limitBytes: Int, minOneMessage: Boolean): LogReadResult = { val offset = fetchInfo.fetchOffset val partitionFetchSize = fetchInfo.maxBytes val followerLogStartOffset = fetchInfo.logStartOffset brokerTopicStats.topicStats(tp.topic).totalFetchRequestRate.mark() brokerTopicStats.allTopicsStats.totalFetchRequestRate.mark() try { trace(s"Fetching log segment for partition $tp, offset $offset, partition fetch size $partitionFetchSize, " + s"remaining response limit $limitBytes" + (if (minOneMessage) s", ignoring response/partition size limits" else "")) // decide whether to only fetch from leader //当前fetchOnlyFromLeader始终为true val localReplica = if (fetchOnlyFromLeader) { //获取该分区对应的leader Replica对象 getLeaderReplicaIfLocal(tp) } else getReplicaOrException(tp) //获取 hw 位置,副本同步不设置这个值 val initialHighWatermark = localReplica.highWatermark.messageOffset val lastStableOffset = if (isolationLevel == IsolationLevel.READ_COMMITTED) Some(localReplica.lastStableOffset.messageOffset) else None // decide whether to only fetch committed data (i.e. messages below high watermark) //readOnlyCommitted=true,表示是consumer只能消费到hw前的消息,为false表示是follower同步消息,没有这个限制 val maxOffsetOpt = if (readOnlyCommitted) Some(lastStableOffset.getOrElse(initialHighWatermark)) else None /* Read the LogOffsetMetadata prior to performing the read from the log. * We use the LogOffsetMetadata to determine if a particular replica is in-sync or not. * Using the log end offset after performing the read can lead to a race condition * where data gets appended to the log immediately after the replica has consumed from it * This can cause a replica to always be out of sync. */ //Log end offset val initialLogEndOffset = localReplica.logEndOffset.messageOffset //log Start Offset val initialLogStartOffset = localReplica.logStartOffset val fetchTimeMs = time.milliseconds val logReadInfo = localReplica.log match { case Some(log) => val adjustedFetchSize = math.min(partitionFetchSize, limitBytes) // Try the read first, this tells us whether we need all of adjustedFetchSize for this partition //从指定的 offset 位置开始读取数据,如果是follower副本同步,则maxOffsetOpt=None val fetch = log.read(offset, adjustedFetchSize, maxOffsetOpt, minOneMessage, isolationLevel) // If the partition is being throttled, simply return an empty set. //如果partition被限速了,那么返回 空 集合 if (shouldLeaderThrottle(quota, tp, replicaId)) FetchDataInfo(fetch.fetchOffsetMetadata, MemoryRecords.EMPTY) // For FetchRequest version 3, we replace incomplete message sets with an empty one as consumers can make // progress in such cases and don't need to report a `RecordTooLargeException` else if (!hardMaxBytesLimit && fetch.firstEntryIncomplete) FetchDataInfo(fetch.fetchOffsetMetadata, MemoryRecords.EMPTY) else fetch case None => error(s"Leader for partition $tp does not have a local log") FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY) } LogReadResult(info = logReadInfo, highWatermark = initialHighWatermark, leaderLogStartOffset = initialLogStartOffset, leaderLogEndOffset = initialLogEndOffset, followerLogStartOffset = followerLogStartOffset, fetchTimeMs = fetchTimeMs, readSize = partitionFetchSize, lastStableOffset = lastStableOffset, exception = None) } catch { // NOTE: Failed fetch requests metric is not incremented for known exceptions since it // is supposed to indicate un-expected failure of a broker in handling a fetch request case e@ (_: UnknownTopicOrPartitionException | _: NotLeaderForPartitionException | _: ReplicaNotAvailableException | _: KafkaStorageException | _: OffsetOutOfRangeException) => LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY), highWatermark = -1L, leaderLogStartOffset = -1L, leaderLogEndOffset = -1L, followerLogStartOffset = -1L, fetchTimeMs = -1L, readSize = partitionFetchSize, lastStableOffset = None, exception = Some(e)) case e: Throwable => brokerTopicStats.topicStats(tp.topic).failedFetchRequestRate.mark() brokerTopicStats.allTopicsStats.failedFetchRequestRate.mark() error(s"Error processing fetch operation on partition $tp, offset $offset", e) LogReadResult(info = FetchDataInfo(LogOffsetMetadata.UnknownOffsetMetadata, MemoryRecords.EMPTY), highWatermark = -1L, leaderLogStartOffset = -1L, leaderLogEndOffset = -1L, followerLogStartOffset = -1L, fetchTimeMs = -1L, readSize = partitionFetchSize, lastStableOffset = None, exception = Some(e)) } } var limitBytes = fetchMaxBytes //存储拉取的数据结果集 val result = new mutable.ArrayBuffer[(TopicPartition, LogReadResult)] var minOneMessage = !hardMaxBytesLimit //遍历要拉取的partition,并调用read方法读取数据 readPartitionInfo.foreach { case (tp, fetchInfo) => val readResult = read(tp, fetchInfo, limitBytes, minOneMessage) val recordBatchSize = readResult.info.records.sizeInBytes // Once we read from a non-empty partition, we stop ignoring request and partition level size limits if (recordBatchSize > 0) minOneMessage = false limitBytes = math.max(0, limitBytes - recordBatchSize) result += (tp -> readResult) } result }
Log.read()方法:
/** * Read messages from the log. * * @param startOffset The offset to begin reading at * @param maxLength The maximum number of bytes to read * @param maxOffset The offset to read up to, exclusive. (i.e. this offset NOT included in the resulting message set) * @param minOneMessage If this is true, the first message will be returned even if it exceeds `maxLength` (if one exists) * @param isolationLevel The isolation level of the fetcher. The READ_UNCOMMITTED isolation level has the traditional * read semantics (e.g. consumers are limited to fetching up to the high watermark). In * READ_COMMITTED, consumers are limited to fetching up to the last stable offset. Additionally, * in READ_COMMITTED, the transaction index is consulted after fetching to collect the list * of aborted transactions in the fetch range which the consumer uses to filter the fetched * records before they are returned to the user. Note that fetches from followers always use * READ_UNCOMMITTED. * * @throws OffsetOutOfRangeException If startOffset is beyond the log end offset or before the log start offset * @return The fetch data information including fetch starting offset metadata and messages read. */ //从指定 startOffset 开始读取数据 def read(startOffset: Long, maxLength: Int, maxOffset: Option[Long] = None, minOneMessage: Boolean = false, isolationLevel: IsolationLevel): FetchDataInfo = { maybeHandleIOException(s"Exception while reading from $topicPartition in dir ${dir.getParent}") { trace(s"Reading $maxLength bytes from offset $startOffset of length $size bytes") // Because we don't use lock for reading, the synchronization is a little bit tricky. // We create the local variables to avoid race conditions with updates to the log. val currentNextOffsetMetadata = nextOffsetMetadata val next = currentNextOffsetMetadata.messageOffset if (startOffset == next) { val abortedTransactions = if (isolationLevel == IsolationLevel.READ_COMMITTED) Some(List.empty[AbortedTransaction]) else None return FetchDataInfo(currentNextOffsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false, abortedTransactions = abortedTransactions) } // 从跳跃表中查找对应的日志分段(logSegment) var segmentEntry = segments.floorEntry(startOffset) // return error on attempt to read beyond the log end offset or read below log start offset if (startOffset > next || segmentEntry == null || startOffset < logStartOffset) throw new OffsetOutOfRangeException("Request for offset %d but we only have log segments in the range %d to %d.".format(startOffset, logStartOffset, next)) // Do the read on the segment with a base offset less than the target offset // but if that segment doesn't contain any messages with an offset greater than that // continue to read from successive segments until we get some messages or we reach the end of the log while (segmentEntry != null) { val segment = segmentEntry.getValue // If the fetch occurs on the active segment, there might be a race condition where two fetch requests occur after // the message is appended but before the nextOffsetMetadata is updated. In that case the second fetch may // cause OffsetOutOfRangeException. To solve that, we cap the reading up to exposed position instead of the log // end of the active segment. val maxPosition = { if (segmentEntry == segments.lastEntry) { val exposedPos = nextOffsetMetadata.relativePositionInSegment.toLong // Check the segment again in case a new segment has just rolled out. //刚好此时产生了新的 segment文件, 再次判断 if (segmentEntry != segments.lastEntry) // New log segment has rolled out, we can read up to the file end. segment.size else exposedPos } else { segment.size } } //从 LogSegment 中读取相应的数据 val fetchInfo = segment.read(startOffset, maxOffset, maxLength, maxPosition, minOneMessage) if (fetchInfo == null) { //如果该日志分段没有读取到数据, 则读取更高的日志分段 segmentEntry = segments.higherEntry(segmentEntry.getKey) } else { return isolationLevel match { case IsolationLevel.READ_UNCOMMITTED => fetchInfo case IsolationLevel.READ_COMMITTED => addAbortedTransactions(startOffset, segmentEntry, fetchInfo) } } } // okay we are beyond the end of the last segment with no data fetched although the start offset is in range, // this can happen when all messages with offset larger than start offsets have been deleted. // In this case, we will return the empty set with log end offset metadata FetchDataInfo(nextOffsetMetadata, MemoryRecords.EMPTY) } }
会继续调用LogSegment.read():
/** * Read a message set from this segment beginning with the first offset >= startOffset. The message set will include * no more than maxSize bytes and will end before maxOffset if a maxOffset is specified. * * @param startOffset A lower bound on the first offset to include in the message set we read * @param maxSize The maximum number of bytes to include in the message set we read * @param maxOffset An optional maximum offset for the message set we read * @param maxPosition The maximum position in the log segment that should be exposed for read * @param minOneMessage If this is true, the first message will be returned even if it exceeds `maxSize` (if one exists) * * @return The fetched data and the offset metadata of the first message whose offset is >= startOffset, * or null if the startOffset is larger than the largest offset in this log */ @threadsafe def read(startOffset: Long, maxOffset: Option[Long], maxSize: Int, maxPosition: Long = size, minOneMessage: Boolean = false): FetchDataInfo = { if (maxSize < 0) throw new IllegalArgumentException("Invalid max size for log read (%d)".format(maxSize)) //log文件大小 val logSize = log.sizeInBytes // this may change, need to save a consistent copy //将起始的 offset 转换为起始的实际物理位置 val startOffsetAndSize = translateOffset(startOffset) // if the start position is already off the end of the log, return null //如果起始位置已经超出了日志的末尾,则返回null if (startOffsetAndSize == null) return null val startPosition = startOffsetAndSize.position val offsetMetadata = new LogOffsetMetadata(startOffset, this.baseOffset, startPosition) val adjustedMaxSize = if (minOneMessage) math.max(maxSize, startOffsetAndSize.size) else maxSize // return a log segment but with zero size in the case below if (adjustedMaxSize == 0) return FetchDataInfo(offsetMetadata, MemoryRecords.EMPTY) // calculate the length of the message set to read based on whether or not they gave us a maxOffset //计算要读取的消息长度 val fetchSize: Int = maxOffset match { //maxOffset=None,表示follower副本同步时的计算方式 case None => // no max offset, just read until the max position min((maxPosition - startPosition).toInt, adjustedMaxSize) //consumer拉取消息的计算方式 case Some(offset) => // there is a max offset, translate it to a file position and use that to calculate the max read size; // when the leader of a partition changes, it's possible for the new leader's high watermark to be less than the // true high watermark in the previous leader for a short window. In this window, if a consumer fetches on an // offset between new leader's high watermark and the log end offset, we want to return an empty response. if (offset < startOffset) return FetchDataInfo(offsetMetadata, MemoryRecords.EMPTY, firstEntryIncomplete = false) val mapping = translateOffset(offset, startPosition) val endPosition = if (mapping == null) logSize // the max offset is off the end of the log, use the end of the file else mapping.position min(min(maxPosition, endPosition) - startPosition, adjustedMaxSize).toInt } //根据起始的物理位置和读取长度读取数据文件 FetchDataInfo(offsetMetadata, log.read(startPosition, fetchSize), firstEntryIncomplete = adjustedMaxSize < startOffsetAndSize.size) } /** * Find the physical file position for the first message with offset >= the requested offset. * * The startingFilePosition argument is an optimization that can be used if we already know a valid starting position * in the file higher than the greatest-lower-bound from the index. * * @param offset The offset we want to translate * @param startingFilePosition A lower bound on the file position from which to begin the search. This is purely an optimization and * when omitted, the search will begin at the position in the offset index. * @return The position in the log storing the message with the least offset >= the requested offset and the size of the * message or null if no message meets this criteria. */ //查找 offset 索引文件:调用 offset 索引文件的 lookup() 查找方法,获取离 startOffset 最接近的物理位置; //调用数据文件的 searchFor() 方法,从指定的物理位置开始读取每条数据,知道找到对应 offset 的物理位置。 @threadsafe private[log] def translateOffset(offset: Long, startingFilePosition: Int = 0): LogOffsetPosition = { val mapping = index.lookup(offset) log.searchForOffsetWithSize(offset, max(mapping.position, startingFilePosition)) }
最后会调用FileRecords.read()方法,根据position和size参数读取数据:
/** * Return a slice of records from this instance, which is a view into this set starting from the given position * and with the given size limit. * * If the size is beyond the end of the file, the end will be based on the size of the file at the time of the read. * * If this message set is already sliced, the position will be taken relative to that slicing. * * @param position The start position to begin the read from * @param size The number of bytes after the start position to include * @return A sliced wrapper on this message set limited based on the given position and size */ public FileRecords read(int position, int size) throws IOException { if (position < 0) throw new IllegalArgumentException("Invalid position: " + position); if (size < 0) throw new IllegalArgumentException("Invalid size: " + size); final int end; // handle integer overflow if (this.start + position + size < 0) end = sizeInBytes(); else end = Math.min(this.start + position + size, sizeInBytes()); return new FileRecords(file, channel, this.start + position, end, true); }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。