赞
踩
本来是想在前面的一篇分析中把源码和内容同时过一遍,可突然发现,那可能是非常大的一章。所以就把源码独立了出来,在此章节中对相关四类内存数据结构进行分析,在代码分析过程中,可以和前面的说明以及早先的日志分析一并进行对比,会有更大的收获。
按照老规矩,先看数据结构的定义相关代码:
struct buf_pool_t { /** @name General fields */ /** @{ */ /** protects (de)allocation of chunks: - changes to chunks, n_chunks are performed while holding this latch, - reading buf_pool_should_madvise requires holding this latch for any buf_pool_t - writing to buf_pool_should_madvise requires holding these latches for all buf_pool_t-s */ BufListMutex chunks_mutex; /** LRU list mutex */ BufListMutex LRU_list_mutex; /** free and withdraw list mutex */ BufListMutex free_list_mutex; /** buddy allocator mutex */ BufListMutex zip_free_mutex; /** zip_hash mutex */ BufListMutex zip_hash_mutex; /** Flush state protection mutex */ ib_mutex_t flush_state_mutex; /** Zip mutex of this buffer pool instance, protects compressed only pages (of type buf_page_t, not buf_block_t */ BufPoolZipMutex zip_mutex; /** Array index of this buffer pool instance */ ulint instance_no; /** Current pool size in bytes */ ulint curr_pool_size; /** Reserve this much of the buffer pool for "old" blocks */ ulint LRU_old_ratio; #ifdef UNIV_DEBUG /** Number of frames allocated from the buffer pool to the buddy system. Protected by zip_hash_mutex. */ ulint buddy_n_frames; #endif /** Allocator used for allocating memory for the the "chunks" member. */ ut_allocator<unsigned char> allocator; /** Number of buffer pool chunks */ volatile ulint n_chunks; /** New number of buffer pool chunks */ volatile ulint n_chunks_new; /** buffer pool chunks */ buf_chunk_t *chunks; /** old buffer pool chunks to be freed after resizing buffer pool */ buf_chunk_t *chunks_old; /** Current pool size in pages */ ulint curr_size; /** Previous pool size in pages */ ulint old_size; /** Size in pages of the area which the read-ahead algorithms read if invoked */ page_no_t read_ahead_area; /** Hash table of buf_page_t or buf_block_t file pages, buf_page_in_file() == TRUE, indexed by (space_id, offset). page_hash is protected by an array of mutexes. */ hash_table_t *page_hash; /** Old pointer to page_hash to be freed after resizing buffer pool */ hash_table_t *page_hash_old; /** Hash table of buf_block_t blocks whose frames are allocated to the zip buddy system, indexed by block->frame */ hash_table_t *zip_hash; /** Number of pending read operations. Accessed atomically */ std::atomic<ulint> n_pend_reads; /** number of pending decompressions. Accessed atomically. */ std::atomic<ulint> n_pend_unzip; /** when buf_print_io was last time called. Accesses not protected. */ ib_time_monotonic_t last_printout_time; /** Statistics of buddy system, indexed by block size. Protected by zip_free mutex, except for the used field, which is also accessed atomically */ buf_buddy_stat_t buddy_stat[BUF_BUDDY_SIZES_MAX + 1]; /** Current statistics */ buf_pool_stat_t stat; /** Old statistics */ buf_pool_stat_t old_stat; /* @} */ /** @name Page flushing algorithm fields */ /** @{ */ /** Mutex protecting the flush list access. This mutex protects flush_list, flush_rbt and bpage::list pointers when the bpage is on flush_list. It also protects writes to bpage::oldest_modification and flush_list_hp */ BufListMutex flush_list_mutex; /** "Hazard pointer" used during scan of flush_list while doing flush list batch. Protected by flush_list_mutex */ FlushHp flush_hp; /** Entry pointer to scan the oldest page except for system temporary */ FlushHp oldest_hp; /** Base node of the modified block list */ UT_LIST_BASE_NODE_T(buf_page_t) flush_list; /** This is true when a flush of the given type is being initialized. Protected by flush_state_mutex. */ bool init_flush[BUF_FLUSH_N_TYPES]; /** This is the number of pending writes in the given flush type. Protected by flush_state_mutex. */ ulint n_flush[BUF_FLUSH_N_TYPES]; /** This is in the set state when there is no flush batch of the given type running. Protected by flush_state_mutex. */ os_event_t no_flush[BUF_FLUSH_N_TYPES]; /** A red-black tree is used exclusively during recovery to speed up insertions in the flush_list. This tree contains blocks in order of oldest_modification LSN and is kept in sync with the flush_list. Each member of the tree MUST also be on the flush_list. This tree is relevant only in recovery and is set to NULL once the recovery is over. Protected by flush_list_mutex */ ib_rbt_t *flush_rbt; /** A sequence number used to count the number of buffer blocks removed from the end of the LRU list; NOTE that this counter may wrap around at 4 billion! A thread is allowed to read this for heuristic purposes without holding any mutex or latch. For non-heuristic purposes protected by LRU_list_mutex */ ulint freed_page_clock; /** Set to false when an LRU scan for free block fails. This flag is used to avoid repeated scans of LRU list when we know that there is no free block available in the scan depth for eviction. Set to TRUE whenever we flush a batch from the buffer pool. Accessed protected by memory barriers. */ bool try_LRU_scan; /** Page Tracking start LSN. */ lsn_t track_page_lsn; /** Maximum LSN for which write io has already started. */ lsn_t max_lsn_io; /* @} */ /** @name LRU replacement algorithm fields */ /** @{ */ /** Base node of the free block list */ UT_LIST_BASE_NODE_T(buf_page_t) free; /** base node of the withdraw block list. It is only used during shrinking buffer pool size, not to reuse the blocks will be removed. Protected by free_list_mutex */ UT_LIST_BASE_NODE_T(buf_page_t) withdraw; /** Target length of withdraw block list, when withdrawing */ ulint withdraw_target; /** "hazard pointer" used during scan of LRU while doing LRU list batch. Protected by buf_pool::LRU_list_mutex */ LRUHp lru_hp; /** Iterator used to scan the LRU list when searching for replaceable victim. Protected by buf_pool::LRU_list_mutex. */ LRUItr lru_scan_itr; /** Iterator used to scan the LRU list when searching for single page flushing victim. Protected by buf_pool::LRU_list_mutex. */ LRUItr single_scan_itr; /** Base node of the LRU list */ UT_LIST_BASE_NODE_T(buf_page_t) LRU; /** Pointer to the about LRU_old_ratio/BUF_LRU_OLD_RATIO_DIV oldest blocks in the LRU list; NULL if LRU length less than BUF_LRU_OLD_MIN_LEN; NOTE: when LRU_old != NULL, its length should always equal LRU_old_len */ buf_page_t *LRU_old; /** Length of the LRU list from the block to which LRU_old points onward, including that block; see buf0lru.cc for the restrictions on this value; 0 if LRU_old == NULL; NOTE: LRU_old_len must be adjusted whenever LRU_old shrinks or grows! */ ulint LRU_old_len; /** Base node of the unzip_LRU list. The list is protected by the LRU_list_mutex. */ UT_LIST_BASE_NODE_T(buf_block_t) unzip_LRU; /** @} */ /** @name Buddy allocator fields The buddy allocator is used for allocating compressed page frames and buf_page_t descriptors of blocks that exist in the buffer pool only in compressed form. */ /** @{ */ #if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG /** Unmodified compressed pages */ UT_LIST_BASE_NODE_T(buf_page_t) zip_clean; #endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */ /** Buddy free lists */ UT_LIST_BASE_NODE_T(buf_buddy_free_t) zip_free[BUF_BUDDY_SIZES_MAX]; /** Sentinel records for buffer pool watches. Scanning the array is protected by taking all page_hash latches in X. Updating or reading an individual watch page is protected by a corresponding individual page_hash latch. */ buf_page_t *watch; /** A wrapper for buf_pool_t::allocator.alocate_large which also advices the OS that this chunk should not be dumped to a core file if that was requested. Emits a warning to the log and disables @@global.core_file if advising was requested but could not be performed, but still return true as the allocation itself succeeded. @param[in] mem_size number of bytes to allocate @param[in,out] chunk mem and mem_pfx fields of this chunk will be updated to contain information about allocated memory region @return true iff allocated successfully */ bool allocate_chunk(ulonglong mem_size, buf_chunk_t *chunk); /** A wrapper for buf_pool_t::allocator.deallocate_large which also advices the OS that this chunk can be dumped to a core file. Emits a warning to the log and disables @@global.core_file if advising was requested but could not be performed. @param[in] chunk mem and mem_pfx fields of this chunk will be used to locate the memory region to free */ void deallocate_chunk(buf_chunk_t *chunk); /** Advices the OS that all chunks in this buffer pool instance can be dumped to a core file. Emits a warning to the log if could not succeed. @return true iff succeeded, false if no OS support or failed */ bool madvise_dump(); /** Advices the OS that all chunks in this buffer pool instance should not be dumped to a core file. Emits a warning to the log if could not succeed. @return true iff succeeded, false if no OS support or failed * / bool madvise_dont_dump(); #if BUF_BUDDY_LOW > UNIV_ZIP_SIZE_MIN #error "BUF_BUDDY_LOW > UNIV_ZIP_SIZE_MIN" #endif * / };
此数据结构体的定义前面的说明告诉大家,这个是内部自用的数据结构体,外面不要使用。再看里面的变量定义:
开头是几个缓冲区的并发控制锁定义,有块的,有LRU链表的,还是压缩的等等。然后定义了一个allocator分配器,大小是以无符号字节类型为基本分配单位。后面就是对Chunk的定义。然后就是HASH表的定义,最后是一些LRU相关的数据结构和链表的定义。这个数据结构体相对来说要比其它的数据结构定义要简单些,但是不要小觑。
在Buffer Pool中有几个重要的数据结构来控制整个缓存的管理,其中buf_pool_t是用来做实现Buffer Pool实例的数据结构。而buf_block_t和buf_page_t是对数据页进行管理分配的,而buf_chunk_t是buffer pool的分配的基本单位。换成人话来说就是,MySql中的缓冲池是按buf_chunk_t来分配的,而在buf_chunk_t里面管理了buf_block_t和buf_page_t,最终形成一个缓冲池内的Instance。而整个缓冲池中有多个这种Instance。下面就分别看一下相关的数据结构定义:
/** A chunk of buffers. The buffer pool is allocated in chunks. */ struct buf_chunk_t { ulint size; /*!< size of frames[] and blocks[] */ unsigned char *mem; /*!< pointer to the memory area which was allocated for the frames */ ut_new_pfx_t mem_pfx; /*!< Auxiliary structure, describing "mem". It is filled by the allocator's alloc method and later passed to the deallocate method. */ buf_block_t *blocks; /*!< array of buffer control blocks * / /** Get the size of 'mem' in bytes. * / size_t mem_size() const { return (mem_pfx.m_size); } bool madvise_dump(); bool madvise_dont_dump(); bool contains(const buf_block_t * ptr) const { return std::less_equal<const buf_block_t * >{}(blocks, ptr) && std::less<const buf_block_t * >{}(ptr, blocks + size); } };
再看一下buf_page_t:
class buf_page_t { public: /** Copy constructor. @param[in] other Instance to copy from. */ buf_page_t(const buf_page_t &other) : id(other.id), size(other.size), buf_fix_count(other.buf_fix_count), io_fix(other.io_fix), state(other.state), flush_type(other.flush_type), buf_pool_index(other.buf_pool_index), #ifndef UNIV_HOTBACKUP hash(other.hash), #endif /* !UNIV_HOTBACKUP */ list(other.list), newest_modification(other.newest_modification), oldest_modification(other.oldest_modification), LRU(other.LRU), zip(other.zip) #ifndef UNIV_HOTBACKUP , m_flush_observer(other.m_flush_observer), m_space(other.m_space), freed_page_clock(other.freed_page_clock), access_time(other.access_time), m_version(other.m_version), m_dblwr_id(other.m_dblwr_id), old(other.old) #ifdef UNIV_DEBUG , file_page_was_freed(other.file_page_was_freed), in_flush_list(other.in_flush_list), in_free_list(other.in_free_list), in_LRU_list(other.in_LRU_list), in_page_hash(other.in_page_hash), in_zip_hash(other.in_zip_hash) #endif /* UNIV_DEBUG */ #endif /* !UNIV_HOTBACKUP */ { #ifndef UNIV_HOTBACKUP m_space->inc_ref(); #endif /* !UNIV_HOTBACKUP * / }
还有一个:
struct buf_block_t { /** @name General fields */ /** @{ */ /** page information; this must be the first field, so that buf_pool->page_hash can point to buf_page_t or buf_block_t */ buf_page_t page; #ifndef UNIV_HOTBACKUP /** read-write lock of the buffer frame */ BPageLock lock; #endif /* UNIV_HOTBACKUP */ /** pointer to buffer frame which is of size UNIV_PAGE_SIZE, and aligned to an address divisible by UNIV_PAGE_SIZE */ byte *frame; /** node of the decompressed LRU list; a block is in the unzip_LRU list if page.state == BUF_BLOCK_FILE_PAGE and page.zip.data != NULL. Protected by both LRU_list_mutex and the block mutex. */ UT_LIST_NODE_T(buf_block_t) unzip_LRU; #ifdef UNIV_DEBUG /** TRUE if the page is in the decompressed LRU list; used in debugging */ bool in_unzip_LRU_list; bool in_withdraw_list; #endif /* UNIV_DEBUG */ /** hashed value of the page address in the record lock hash table; protected by buf_block_t::lock (or buf_block_t::mutex in buf_page_get_gen(), buf_page_init_for_read() and buf_page_create()) */ uint32_t lock_hash_val; /** @} */ /** @name Hash search fields (unprotected) NOTE that these fields are NOT protected by any semaphore! */ /** @{ */ /** Counter which controls building of a new hash index for the page */ uint32_t n_hash_helps; /** Recommended prefix length for hash search: number of bytes in an incomplete last field */ volatile uint32_t n_bytes; /** Recommended prefix length for hash search: number of full fields */ volatile uint32_t n_fields; /** true or false, depending on whether the leftmost record of several records with the same prefix should be indexed in the hash index */ volatile bool left_side; /** @} */ /** @name Hash search fields These 5 fields may only be modified when: we are holding the appropriate x-latch in btr_search_latches[], and one of the following holds: (1) the block state is BUF_BLOCK_FILE_PAGE, and we are holding an s-latch or x-latch on buf_block_t::lock, or (2) buf_block_t::buf_fix_count == 0, or (3) the block state is BUF_BLOCK_REMOVE_HASH. An exception to this is when we init or create a page in the buffer pool in buf0buf.cc. Another exception for buf_pool_clear_hash_index() is that assigning block->index = NULL (and block->n_pointers = 0) is allowed whenever btr_search_own_all(RW_LOCK_X). Another exception is that ha_insert_for_fold_func() may decrement n_pointers without holding the appropriate latch in btr_search_latches[]. Thus, n_pointers must be protected by atomic memory access. This implies that the fields may be read without race condition whenever any of the following hold: - the btr_search_latches[] s-latch or x-latch is being held, or - the block state is not BUF_BLOCK_FILE_PAGE or BUF_BLOCK_REMOVE_HASH, and holding some latch prevents the state from changing to that. Some use of assert_block_ahi_empty() or assert_block_ahi_valid() is prone to race conditions while buf_pool_clear_hash_index() is executing (the adaptive hash index is being disabled). Such use is explicitly commented. */ /** @{ */ #if defined UNIV_AHI_DEBUG || defined UNIV_DEBUG /** used in debugging: the number of pointers in the adaptive hash index pointing to this frame; protected by atomic memory access or btr_search_own_all(). */ std::atomic<ulint> n_pointers; #define assert_block_ahi_empty(block) ut_a((block)->n_pointers.load() == 0) #define assert_block_ahi_empty_on_init(block) \ do { \ UNIV_MEM_VALID(&(block)->n_pointers, sizeof(block)->n_pointers); \ assert_block_ahi_empty(block); \ } while (0) #define assert_block_ahi_valid(block) \ ut_a((block)->index || (block)->n_pointers.load() == 0) #else /* UNIV_AHI_DEBUG || UNIV_DEBUG */ #define assert_block_ahi_empty(block) /* nothing */ #define assert_block_ahi_empty_on_init(block) /* nothing */ #define assert_block_ahi_valid(block) /* nothing */ #endif /* UNIV_AHI_DEBUG || UNIV_DEBUG */ /** prefix length for hash indexing: number of full fields */ uint16_t curr_n_fields; /** number of bytes in hash indexing */ uint16_t curr_n_bytes; /** TRUE or FALSE in hash indexing */ bool curr_left_side; /** true if block has been made dirty without acquiring X/SX latch as the block belongs to temporary tablespace and block is always accessed by a single thread. */ bool made_dirty_with_no_latch; /** Index for which the adaptive hash index has been created, or NULL if the page does not exist in the index. Note that it does not guarantee that the index is complete, though: there may have been hash collisions, record deletions, etc. */ dict_index_t *index; /** @} */ #ifndef UNIV_HOTBACKUP #ifdef UNIV_DEBUG /** @name Debug fields */ /** @{ */ /** In the debug version, each thread which bufferfixes the block acquires an s-latch here; so we can use the debug utilities in sync0rw */ rw_lock_t debug_latch; /** @} */ #endif /* UNIV_DEBUG */ #endif /* !UNIV_HOTBACKUP */ /** @name Optimistic search field */ /** @{ */ /** This clock is incremented every time a pointer to a record on the page may become obsolete; this is used in the optimistic cursor positioning: if the modify clock has not changed, we know that the pointer is still valid; this field may be changed if the thread (1) owns the LRU list mutex and the page is not bufferfixed, or (2) the thread has an x-latch on the block, or (3) the block must belong to an intrinsic table */ uint64_t modify_clock; /** @} */ /** mutex protecting this block: state (also protected by the buffer pool mutex), io_fix, buf_fix_count, and accessed; we introduce this new mutex in InnoDB-5.1 to relieve contention on the buffer pool mutex */ BPageMutex mutex; /** Get the page number and space id of the current buffer block. @return page number of the current buffer block. */ const page_id_t &get_page_id() const { return page.id; } /** Get the page number of the current buffer block. @return page number of the current buffer block. */ page_no_t get_page_no() const { return (page.id.page_no()); } /** Get the next page number of the current buffer block. @return next page number of the current buffer block. */ page_no_t get_next_page_no() const { return (mach_read_from_4(frame + FIL_PAGE_NEXT)); } /** Get the prev page number of the current buffer block. @return prev page number of the current buffer block. */ page_no_t get_prev_page_no() const { return (mach_read_from_4(frame + FIL_PAGE_PREV)); } /** Get the page type of the current buffer block. @return page type of the current buffer block. */ page_type_t get_page_type() const { return (mach_read_from_2(frame + FIL_PAGE_TYPE)); } /** Get the page type of the current buffer block as string. @return page type of the current buffer block as string. */ const char *get_page_type_str() const noexcept MY_ATTRIBUTE((warn_unused_result)); }
其实可以理解为Buffer Pools是以buf_chunk_t来管理buf_page_t,二者可以可以互换,一个buf_chunk_t有一个buf_page_t。再说得更直白一些就是类似于一个企业,有公司,公司下有部门,部门下有人,有的部门只有一个人也得成立一个部门。
再来看一下这个用于减少更新IO操作的Change Buffer:
/** Default value for maximum on-disk size of change buffer in terms of percentage of the buffer pool. */ #define CHANGE_BUFFER_DEFAULT_SIZE (25) #ifndef UNIV_HOTBACKUP /* Possible operations buffered in the insert/whatever buffer. See ibuf_insert(). DO NOT CHANGE THE VALUES OF THESE, THEY ARE STORED ON DISK. */ typedef enum { IBUF_OP_INSERT = 0, IBUF_OP_DELETE_MARK = 1, IBUF_OP_DELETE = 2, /* Number of different operation types. */ IBUF_OP_COUNT = 3 } ibuf_op_t; /** Combinations of operations that can be buffered. @see innodb_change_buffering_names */ enum ibuf_use_t { IBUF_USE_NONE = 0, IBUF_USE_INSERT, /* insert */ IBUF_USE_DELETE_MARK, /* delete */ IBUF_USE_INSERT_DELETE_MARK, /* insert+delete */ IBUF_USE_DELETE, /* delete+purge */ IBUF_USE_ALL /* insert+delete+purge */ }; /** Operations that can currently be buffered. */ extern ulong innodb_change_buffering; / ** The insert buffer control structure * / extern ibuf_t * ibuf; /** Insert buffer struct */ struct ibuf_t { ulint size; /*!< current size of the ibuf index tree, in pages */ ulint max_size; /*!< recommended maximum size of the ibuf index tree, in pages */ ulint seg_size; /*!< allocated pages of the file segment containing ibuf header and tree */ bool empty; /*!< Protected by the page latch of the root page of the insert buffer tree (FSP_IBUF_TREE_ROOT_PAGE_NO). true if and only if the insert buffer tree is empty. */ ulint free_list_len; /*!< length of the free list */ ulint height; /*!< tree height */ dict_index_t *index; /*!< insert buffer index */ std::atomic<ulint> n_merges; /*!< number of pages merged */ std::atomic<ulint> n_merged_ops[IBUF_OP_COUNT]; /*!< number of operations of each type merged to index pages */ std::atomic<ulint> n_discarded_ops[IBUF_OP_COUNT]; /*!< number of operations of each type discarded without merging due to the tablespace being deleted or the index being dropped * / };
然后再看一下相关的宏定义,包括位图BITMAP中的相关定义:
/** @name Offsets to the per-page bits in the insert buffer bitmap */ /** @{ */ #define IBUF_BITMAP_FREE \ 0 /*!< Bits indicating the \ amount of free space */ #define IBUF_BITMAP_BUFFERED \ 2 /*!< TRUE if there are buffered \ changes for the page */ #define IBUF_BITMAP_IBUF \ 3 /*!< TRUE if page is a part of \ the ibuf tree, excluding the \ root page, or is in the free \ list of the ibuf */ /** @} */ #define IBUF_REC_FIELD_SPACE \ 0 /*!< in the pre-4.1 format, \ the page number. later, the space_id */ #define IBUF_REC_FIELD_MARKER \ 1 /*!< starting with 4.1, a marker \ consisting of 1 byte that is 0 */ #define IBUF_REC_FIELD_PAGE \ 2 /*!< starting with 4.1, the \ page number */ #define IBUF_REC_FIELD_METADATA 3 /* the metadata field */ #define IBUF_REC_FIELD_USER 4 /* first user field */ /* Various constants for checking the type of an ibuf record and extracting data from it. For details, see the description of the record format at the top of this file. */ /** @name Format of the IBUF_REC_FIELD_METADATA of an insert buffer record The fourth column in the MySQL 5.5 format contains an operation type, counter, and some flags. */ #define IBUF_REC_INFO_SIZE \ 4 /*!< Combined size of info fields at \ the beginning of the fourth field */ #if IBUF_REC_INFO_SIZE >= DATA_NEW_ORDER_NULL_TYPE_BUF_SIZE #error "IBUF_REC_INFO_SIZE >= DATA_NEW_ORDER_NULL_TYPE_BUF_SIZE" #endif /* Offsets for the fields at the beginning of the fourth field */ #define IBUF_REC_OFFSET_COUNTER 0 /*!< Operation counter */ #define IBUF_REC_OFFSET_TYPE 2 /*!< Type of operation */ #define IBUF_REC_OFFSET_FLAGS 3 /*!< Additional flags */ /* Record flag masks */ #define IBUF_REC_COMPACT \ 0x1 /*!< Set in \ IBUF_REC_OFFSET_FLAGS if the \ user index is in COMPACT \ format or later */ /** The mutex used to block pessimistic inserts to ibuf trees */ static ib_mutex_t ibuf_pessimistic_insert_mutex; /** The mutex protecting the insert buffer structs */ static ib_mutex_t ibuf_mutex; /** The mutex protecting the insert buffer bitmaps */ static ib_mutex_t ibuf_bitmap_mutex; /** The area in pages from which contract looks for page numbers for merge */ const ulint IBUF_MERGE_AREA = 8; /** Inside the merge area, pages which have at most 1 per this number less buffered entries compared to maximum volume that can buffered for a single page are merged along with the page whose buffer became full */ const ulint IBUF_MERGE_THRESHOLD = 4; /** In ibuf_contract at most this number of pages is read to memory in one batch, in order to merge the entries for them in the insert buffer */ const ulint IBUF_MAX_N_PAGES_MERGED = IBUF_MERGE_AREA; /** If the combined size of the ibuf trees exceeds ibuf->max_size by this many pages, we start to contract it in connection to inserts there, using non-synchronous contract */ const ulint IBUF_CONTRACT_ON_INSERT_NON_SYNC = 0; /** If the combined size of the ibuf trees exceeds ibuf->max_size by this many pages, we start to contract it in connection to inserts there, using synchronous contract */ const ulint IBUF_CONTRACT_ON_INSERT_SYNC = 5; /** If the combined size of the ibuf trees exceeds ibuf->max_size by this many pages, we start to contract it synchronous contract, but do not insert */ const ulint IBUF_CONTRACT_DO_NOT_INSERT = 10;
需要注意的是,Change Buffer是可以配置何时使用的,也就是说,可以启用也可以不启用。
看完了两个重要的Buffer,再看一下HASH相关:
/* The hash table structure */ struct hash_table_t { enum hash_table_sync_t type; /*!< type of hash_table. */ #if defined UNIV_AHI_DEBUG || defined UNIV_DEBUG #ifndef UNIV_HOTBACKUP ibool adaptive; /* TRUE if this is the hash table of the adaptive hash index */ #endif /* !UNIV_HOTBACKUP */ #endif /* UNIV_AHI_DEBUG || UNIV_DEBUG */ ulint n_cells; /* number of cells in the hash table */ hash_cell_t *cells; /*!< pointer to cell array */ #ifndef UNIV_HOTBACKUP ulint n_sync_obj; /* if sync_objs != NULL, then the number of either the number of mutexes or the number of rw_locks depending on the type. Must be a power of 2 */ union { ib_mutex_t *mutexes; /* NULL, or an array of mutexes used to protect segments of the hash table */ rw_lock_t *rw_locks; /* NULL, or an array of rw_lcoks used to protect segments of the hash table */ } sync_obj; mem_heap_t **heaps; /*!< if this is non-NULL, hash chain nodes for external chaining can be allocated from these memory heaps; there are then n_mutexes many of these heaps */ #endif /* !UNIV_HOTBACKUP */ mem_heap_t *heap; #ifdef UNIV_DEBUG ulint magic_n; #define HASH_TABLE_MAGIC_N 76561114 #endif /* UNIV_DEBUG * / }
其上还封装了一个数据结构:
/** The hash index system */
struct btr_search_sys_t {
hash_table_t **hash_tables; /*!< the adaptive hash tables,
mapping dtuple_fold values
to rec_t pointers on index pages * /
};
同时,为了创建AHI进行数据处理过程中何时触发等等条件还有一个数据结构:
/** The search info struct in an index */ struct btr_search_t { ulint ref_count; /*!< Number of blocks in this index tree that have search index built i.e. block->index points to this index. Protected by search latch except when during initialization in btr_search_info_create(). */ /** @{ The following fields are not protected by any latch. Unfortunately, this means that they must be aligned to the machine word, i.e., they cannot be turned into bit-fields. */ buf_block_t *root_guess; /*!< the root page frame when it was last time fetched, or NULL */ ulint hash_analysis; /*!< when this exceeds BTR_SEARCH_HASH_ANALYSIS, the hash analysis starts; this is reset if no success noticed */ ibool last_hash_succ; /*!< TRUE if the last search would have succeeded, or did succeed, using the hash index; NOTE that the value here is not exact: it is not calculated for every search, and the calculation itself is not always accurate! */ ulint n_hash_potential; /*!< number of consecutive searches which would have succeeded, or did succeed, using the hash index; the range is 0 .. BTR_SEARCH_BUILD_LIMIT + 5 */ /** @} */ /**---------------------- @{ */ ulint n_fields; /*!< recommended prefix length for hash search: number of full fields */ ulint n_bytes; /*!< recommended prefix: number of bytes in an incomplete field @see BTR_PAGE_MAX_REC_SIZE */ ibool left_side; /*!< TRUE or FALSE, depending on whether the leftmost record of several records with the same prefix should be indexed in the hash index */ /*---------------------- @} */ #ifdef UNIV_SEARCH_PERF_STAT ulint n_hash_succ; /*!< number of successful hash searches thus far */ ulint n_hash_fail; /*!< number of failed hash searches */ ulint n_patt_succ; /*!< number of successful pattern searches thus far */ ulint n_searches; /*!< number of searches */ #endif /* UNIV_SEARCH_PERF_STAT */ #ifdef UNIV_DEBUG ulint magic_n; /*!< magic number @see BTR_SEARCH_MAGIC_N */ /** value of btr_search_t::magic_n, used in assertions */ #define BTR_SEARCH_MAGIC_N 1112765 #endif /* UNIV_DEBUG * / };
换一句话说,先通过btr_search_t处理数据,如果触发(需要达到一定次数17次)AHI则调用btr_search_sys_t来处理到hash_table_t数据结构中。
redo log是非常重要的一个日志系统,它同样使用了LOG BUFFER,看一下相关的数据定义:
/** Logging modes for a mini-transaction */ enum mtr_log_t { /** Default mode: log all operations modifying disk-based data */ MTR_LOG_ALL = 0, /** Log no operations and dirty pages are not added to the flush list */ MTR_LOG_NONE = 1, /** Don't generate REDO log but add dirty pages to flush list */ MTR_LOG_NO_REDO = 2, /** Inserts are logged in a shorter form */ MTR_LOG_SHORT_INSERTS = 3, /** Last element */ MTR_LOG_MODE_MAX = 4 }; /** @name Log item types The log items are declared 'byte' so that the compiler can warn if val and type parameters are switched in a call to mlog_write_ulint. NOTE! For 1 - 8 bytes, the flag value must give the length also! @{ */ enum mlog_id_t { /** if the mtr contains only one log record for one page, i.e., write_initial_log_record has been called only once, this flag is ORed to the type of that first log record */ MLOG_SINGLE_REC_FLAG = 128, /** one byte is written */ MLOG_1BYTE = 1, /** 2 bytes ... */ MLOG_2BYTES = 2, /** 4 bytes ... */ MLOG_4BYTES = 4, /** 8 bytes ... */ MLOG_8BYTES = 8, /** Record insert */ MLOG_REC_INSERT = 9, /** Mark clustered index record deleted */ MLOG_REC_CLUST_DELETE_MARK = 10, /** Mark secondary index record deleted */ MLOG_REC_SEC_DELETE_MARK = 11, /** update of a record, preserves record field sizes */ MLOG_REC_UPDATE_IN_PLACE = 13, /*!< Delete a record from a page */ MLOG_REC_DELETE = 14, /** Delete record list end on index page */ MLOG_LIST_END_DELETE = 15, /** Delete record list start on index page */ MLOG_LIST_START_DELETE = 16, /** Copy record list end to a new created index page */ MLOG_LIST_END_COPY_CREATED = 17, /** Reorganize an index page in ROW_FORMAT=REDUNDANT */ MLOG_PAGE_REORGANIZE = 18, /** Create an index page */ MLOG_PAGE_CREATE = 19, /** Insert entry in an undo log */ MLOG_UNDO_INSERT = 20, /** erase an undo log page end */ MLOG_UNDO_ERASE_END = 21, /** initialize a page in an undo log */ MLOG_UNDO_INIT = 22, /** reuse an insert undo log header */ MLOG_UNDO_HDR_REUSE = 24, /** create an undo log header */ MLOG_UNDO_HDR_CREATE = 25, /** mark an index record as the predefined minimum record */ MLOG_REC_MIN_MARK = 26, /** initialize an ibuf bitmap page */ MLOG_IBUF_BITMAP_INIT = 27, #ifdef UNIV_LOG_LSN_DEBUG /** Current LSN */ MLOG_LSN = 28, #endif /* UNIV_LOG_LSN_DEBUG */ /** this means that a file page is taken into use and the prior contents of the page should be ignored: in recovery we must not trust the lsn values stored to the file page. Note: it's deprecated because it causes crash recovery problem in bulk create index, and actually we don't need to reset page lsn in recv_recover_page_func() now. */ MLOG_INIT_FILE_PAGE = 29, /** write a string to a page */ MLOG_WRITE_STRING = 30, /** If a single mtr writes several log records, this log record ends the sequence of these records */ MLOG_MULTI_REC_END = 31, /** dummy log record used to pad a log block full */ MLOG_DUMMY_RECORD = 32, /** log record about creating an .ibd file, with format */ MLOG_FILE_CREATE = 33, /** rename a tablespace file that starts with (space_id,page_no) */ MLOG_FILE_RENAME = 34, /** delete a tablespace file that starts with (space_id,page_no) */ MLOG_FILE_DELETE = 35, /** mark a compact index record as the predefined minimum record */ MLOG_COMP_REC_MIN_MARK = 36, /** create a compact index page */ MLOG_COMP_PAGE_CREATE = 37, /** compact record insert */ MLOG_COMP_REC_INSERT = 38, /** mark compact clustered index record deleted */ MLOG_COMP_REC_CLUST_DELETE_MARK = 39, /** mark compact secondary index record deleted; this log record type is redundant, as MLOG_REC_SEC_DELETE_MARK is independent of the record format. */ MLOG_COMP_REC_SEC_DELETE_MARK = 40, /** update of a compact record, preserves record field sizes */ MLOG_COMP_REC_UPDATE_IN_PLACE = 41, /** delete a compact record from a page */ MLOG_COMP_REC_DELETE = 42, /** delete compact record list end on index page */ MLOG_COMP_LIST_END_DELETE = 43, /*** delete compact record list start on index page */ MLOG_COMP_LIST_START_DELETE = 44, /** copy compact record list end to a new created index page */ MLOG_COMP_LIST_END_COPY_CREATED = 45, /** reorganize an index page */ MLOG_COMP_PAGE_REORGANIZE = 46, /** write the node pointer of a record on a compressed non-leaf B-tree page */ MLOG_ZIP_WRITE_NODE_PTR = 48, /** write the BLOB pointer of an externally stored column on a compressed page */ MLOG_ZIP_WRITE_BLOB_PTR = 49, /** write to compressed page header */ MLOG_ZIP_WRITE_HEADER = 50, /** compress an index page */ MLOG_ZIP_PAGE_COMPRESS = 51, /** compress an index page without logging it's image */ MLOG_ZIP_PAGE_COMPRESS_NO_DATA = 52, /** reorganize a compressed page */ MLOG_ZIP_PAGE_REORGANIZE = 53, /** Create a R-Tree index page */ MLOG_PAGE_CREATE_RTREE = 57, /** create a R-tree compact page */ MLOG_COMP_PAGE_CREATE_RTREE = 58, /** this means that a file page is taken into use. We use it to replace MLOG_INIT_FILE_PAGE. */ MLOG_INIT_FILE_PAGE2 = 59, /** Table is being truncated. (Marked only for file-per-table) */ /* MLOG_TRUNCATE = 60, Disabled for WL6378 */ /** notify that an index tree is being loaded without writing redo log about individual pages */ MLOG_INDEX_LOAD = 61, /** log for some persistent dynamic metadata change */ MLOG_TABLE_DYNAMIC_META = 62, /** create a SDI index page */ MLOG_PAGE_CREATE_SDI = 63, /** create a SDI compact page */ MLOG_COMP_PAGE_CREATE_SDI = 64, /** Extend the space */ MLOG_FILE_EXTEND = 65, /** Used in tests of redo log. It must never be used outside unit tests. */ MLOG_TEST = 66, /** biggest value (used in assertions) * / MLOG_BIGGEST_TYPE = MLOG_TEST }
通过上述的枚举类型可以看出相关的LOG类型有六十多种,所以不用全部一一分析到位,把其中一个弄清楚了,基本就都清楚了。
在log0types.h这个头文件中,定义了大量的与日志相关的数据结构,这里只举一部分:
typedef uint64_t lsn_t; /** Print format for lsn_t values, used in functions like printf. */ #define LSN_PF UINT64PF /** Alias for atomic based on lsn_t. */ using atomic_lsn_t = std::atomic<lsn_t>; /** Type used for sn values, which enumerate bytes of data stored in the log. Note that these values skip bytes of headers and footers of log blocks. */ typedef uint64_t sn_t; /** Alias for atomic based on sn_t. */ using atomic_sn_t = std::atomic<sn_t>; /** Type used for checkpoint numbers (consecutive checkpoints receive a number which is increased by one). */ typedef uint64_t checkpoint_no_t; /** Type used for counters in log_t: flushes_requested and flushes_expected. They represent number of requests to flush the redo log to disk. */ typedef std::atomic<int64_t> log_flushes_t; /** Function used to calculate checksums of log blocks. */ typedef std::atomic<uint32_t (*)(const byte *log_block)> log_checksum_func_t; /** Clock used to measure time spent in redo log (e.g. when flushing). */ using Log_clock = std::chrono::high_resolution_clock; /** Time point defined by the Log_clock. */ using Log_clock_point = std::chrono::time_point<Log_clock>; /** Supported redo log formats. Stored in LOG_HEADER_FORMAT. */ enum log_header_format_t { /** The MySQL 5.7.9 redo log format identifier. We can support recovery from this format if the redo log is clean (logically empty). */ LOG_HEADER_FORMAT_5_7_9 = 1, /** Remove MLOG_FILE_NAME and MLOG_CHECKPOINT, introduce MLOG_FILE_OPEN redo log record. */ LOG_HEADER_FORMAT_8_0_1 = 2, /** Allow checkpoint_lsn to point any data byte within redo log (before it had to point the beginning of a group of log records). */ LOG_HEADER_FORMAT_8_0_3 = 3, /** Expand ulint compressed form. */ LOG_HEADER_FORMAT_8_0_19 = 4, /** The redo log format identifier corresponding to the current format version. */ LOG_HEADER_FORMAT_CURRENT = LOG_HEADER_FORMAT_8_0_19 }; /** The state of a log group */ enum class log_state_t { /** No corruption detected */ OK, /** Corrupted */ CORRUPTED }; /** The recovery implementation. */ struct redo_recover_t; struct Log_handle { lsn_t start_lsn; lsn_t end_lsn; }; /** Redo log - single data structure with state of the redo log system. In future, one could consider splitting this to multiple data structures. */ struct alignas(ut::INNODB_CACHE_LINE_SIZE) log_t { /**************************************************/ /** @name Users writing to log buffer *******************************************************/ /** @{ */ #ifndef UNIV_HOTBACKUP /** Event used for locking sn */ os_event_t sn_lock_event; #ifdef UNIV_PFS_RWLOCK /** The instrumentation hook */ struct PSI_rwlock *pfs_psi; #endif /* UNIV_PFS_RWLOCK */ #ifdef UNIV_DEBUG /** The rw_lock instance only for the debug info list */ /* NOTE: Just "rw_lock_t sn_lock_inst;" and direct minimum initialization seem to hit the bug of Sun Studio of Solaris. */ rw_lock_t *sn_lock_inst; #endif /* UNIV_DEBUG */ /** Current sn value. Used to reserve space in the redo log, and used to acquire an exclusive access to the log buffer. Represents number of data bytes that have ever been reserved. Bytes of headers and footers of log blocks are not included. Its highest bit is used for locking the access to the log buffer. */ MY_COMPILER_DIAGNOSTIC_PUSH() MY_COMPILER_CLANG_WORKAROUND_REF_DOCBUG() /** @see @ref subsect_redo_log_sn */ MY_COMPILER_DIAGNOSTIC_PUSH() alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_sn_t sn; /** Intended sn value while x-locked. */ atomic_sn_t sn_locked; /** Mutex which can be used for x-lock sn value */ mutable ib_mutex_t sn_x_lock_mutex; /** Padding after the _sn to avoid false sharing issues for constants below (due to changes of sn). */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** Pointer to the log buffer, aligned up to OS_FILE_LOG_BLOCK_SIZE. The alignment is to ensure that buffer parts specified for file IO write operations will be aligned to sector size, which is required e.g. on Windows when doing unbuffered file access. Protected by: locking sn not to add. */ aligned_array_pointer<byte, OS_FILE_LOG_BLOCK_SIZE> buf; /** Size of the log buffer expressed in number of data bytes, that is excluding bytes for headers and footers of log blocks. */ atomic_sn_t buf_size_sn; /** Size of the log buffer expressed in number of total bytes, that is including bytes for headers and footers of log blocks. */ size_t buf_size; alignas(ut::INNODB_CACHE_LINE_SIZE) /** The recent written buffer. Protected by: locking sn not to add. */ Link_buf<lsn_t> recent_written; /** Used for pausing the log writer threads. When paused, each user thread should write log as in the former version. */ std::atomic_bool writer_threads_paused; /** Some threads waiting for the ready for write lsn by closer_event. */ lsn_t current_ready_waiting_lsn; /** current_ready_waiting_lsn is waited using this sig_count. */ int64_t current_ready_waiting_sig_count; alignas(ut::INNODB_CACHE_LINE_SIZE) /** The recent closed buffer. Protected by: locking sn not to add. */ Link_buf<lsn_t> recent_closed; alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Users <=> writer *******************************************************/ /** @{ */ /** Maximum sn up to which there is free space in both the log buffer and the log files. This is limitation for the end of any write to the log buffer. Threads, which are limited need to wait, and possibly they hold latches of dirty pages making a deadlock possible. Protected by: writer_mutex (writes). */ atomic_sn_t buf_limit_sn; /** Up to this lsn, data has been written to disk (fsync not required). Protected by: writer_mutex (writes). */ MY_COMPILER_DIAGNOSTIC_PUSH() MY_COMPILER_CLANG_WORKAROUND_REF_DOCBUG() /* @see @ref subsect_redo_log_write_lsn */ MY_COMPILER_DIAGNOSTIC_POP() alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t write_lsn; alignas(ut::INNODB_CACHE_LINE_SIZE) /** Unaligned pointer to array with events, which are used for notifications sent from the log write notifier thread to user threads. The notifications are sent when write_lsn is advanced. User threads wait for write_lsn >= lsn, for some lsn. Log writer advances the write_lsn and notifies the log write notifier, which notifies all users interested in nearby lsn values (lsn belonging to the same log block). Note that false wake-ups are possible, in which case user threads simply retry waiting. */ os_event_t *write_events; /** Number of entries in the array with writer_events. */ size_t write_events_size; /** Approx. number of requests to write/flush redo since startup. */ alignas(ut::INNODB_CACHE_LINE_SIZE) std::atomic<uint64_t> write_to_file_requests_total; /** How often redo write/flush is requested in average. Measures in microseconds. Log threads do not spin when the write/flush requests are not frequent. */ alignas(ut::INNODB_CACHE_LINE_SIZE) std::atomic<uint64_t> write_to_file_requests_interval; /** This padding is probably not needed, left for convenience. */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Users <=> flusher *******************************************************/ /** @{ */ /** Unaligned pointer to array with events, which are used for notifications sent from the log flush notifier thread to user threads. The notifications are sent when flushed_to_disk_lsn is advanced. User threads wait for flushed_to_disk_lsn >= lsn, for some lsn. Log flusher advances the flushed_to_disk_lsn and notifies the log flush notifier, which notifies all users interested in nearby lsn values (lsn belonging to the same log block). Note that false wake-ups are possible, in which case user threads simply retry waiting. */ os_event_t *flush_events; /** Number of entries in the array with events. */ size_t flush_events_size; /** This event is in the reset state when a flush is running; a thread should wait for this without owning any of redo mutexes, but NOTE that to reset this event, the thread MUST own the writer_mutex */ os_event_t old_flush_event; /** Padding before the frequently updated flushed_to_disk_lsn. */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** Up to this lsn data has been flushed to disk (fsynced). */ atomic_lsn_t flushed_to_disk_lsn; /** Padding after the frequently updated flushed_to_disk_lsn. */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Log flusher thread *******************************************************/ /** @{ */ /** Last flush start time. Updated just before fsync starts. */ Log_clock_point last_flush_start_time; /** Last flush end time. Updated just after fsync is finished. If smaller than start time, then flush operation is pending. */ Log_clock_point last_flush_end_time; /** Flushing average time (in microseconds). */ double flush_avg_time; /** Mutex which can be used to pause log flusher thread. */ mutable ib_mutex_t flusher_mutex; alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t flusher_event; /** Padding to avoid any dependency between the log flusher and the log writer threads. */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Log writer thread *******************************************************/ /** @{ */ /** Space id for pages with log blocks. */ space_id_t files_space_id; /** Size of buffer used for the write-ahead (in bytes). */ uint32_t write_ahead_buf_size; /** Aligned pointer to buffer used for the write-ahead. It is aligned to system page size (why?) and is currently limited by constant 64KB. */ aligned_array_pointer<byte, 64 * 1024> write_ahead_buf; /** Up to this file offset in the log files, the write-ahead has been done or is not required (for any other reason). */ uint64_t write_ahead_end_offset; /** Aligned buffers for file headers. */ aligned_array_pointer<byte, OS_FILE_LOG_BLOCK_SIZE> *file_header_bufs; #endif /* !UNIV_HOTBACKUP */ /** Some lsn value within the current log file. */ lsn_t current_file_lsn; /** File offset for the current_file_lsn. */ uint64_t current_file_real_offset; /** Up to this file offset we are within the same current log file. */ uint64_t current_file_end_offset; /** Number of performed IO operations (only for printing stats). */ uint64_t n_log_ios; /** Size of each single log file (expressed in bytes, including file header). */ uint64_t file_size; /** Number of log files. */ uint32_t n_files; /** Total capacity of all the log files (file_size * n_files), including headers of the log files. */ uint64_t files_real_capacity; /** Capacity of redo log files for log writer thread. The log writer does not to exceed this value. If space is not reclaimed after 1 sec wait, it writes only as much as can fit the free space or crashes if there is no free space at all (checkpoint did not advance for 1 sec). */ lsn_t lsn_capacity_for_writer; /** When this margin is being used, the log writer decides to increase the concurrency_margin to stop new incoming mini-transactions earlier, on bigger margin. This is used to provide adaptive concurrency margin calculation, which we need because we might have unlimited thread concurrency setting or we could miss some log_free_check() calls. It is just best effort to help getting out of the troubles. */ lsn_t extra_margin; /** True if we haven't increased the concurrency_margin since we entered (lsn_capacity_for_margin_inc..lsn_capacity_for_writer] range. This allows to increase the margin only once per issue and wait until the issue becomes resolved, still having an option to increase margin even more, if new issue comes later. */ bool concurrency_margin_ok; /** Maximum allowed concurrency_margin. We never set higher, even when we increase the concurrency_margin in the adaptive solution. */ lsn_t max_concurrency_margin; #ifndef UNIV_HOTBACKUP /** Mutex which can be used to pause log writer thread. */ mutable ib_mutex_t writer_mutex; alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t writer_event; /** Padding after section for the log writer thread, to avoid any dependency between the log writer and the log closer threads. */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Log closer thread *******************************************************/ /** @{ */ /** Event used by the log closer thread to wait for tasks. */ os_event_t closer_event; /** Mutex which can be used to pause log closer thread. */ mutable ib_mutex_t closer_mutex; /** Padding after the log closer thread and before the memory used for communication between the log flusher and notifier threads. */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Log flusher <=> flush_notifier *******************************************************/ /** @{ */ /** Event used by the log flusher thread to notify the log flush notifier thread, that it should proceed with notifying user threads waiting for the advanced flushed_to_disk_lsn (because it has been advanced). */ os_event_t flush_notifier_event; /** The next flushed_to_disk_lsn can be waited using this sig_count. */ int64_t current_flush_sig_count; /** Mutex which can be used to pause log flush notifier thread. */ mutable ib_mutex_t flush_notifier_mutex; /** Padding. */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Log writer <=> write_notifier *******************************************************/ /** @{ */ /** Mutex which can be used to pause log write notifier thread. */ mutable ib_mutex_t write_notifier_mutex; alignas(ut::INNODB_CACHE_LINE_SIZE) /** Event used by the log writer thread to notify the log write notifier thread, that it should proceed with notifying user threads waiting for the advanced write_lsn (because it has been advanced). */ os_event_t write_notifier_event; alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Maintenance *******************************************************/ /** @{ */ /** Used for stopping the log background threads. */ std::atomic_bool should_stop_threads; /** Event used for pausing the log writer threads. */ os_event_t writer_threads_resume_event; /** Used for resuming write notifier thread */ atomic_lsn_t write_notifier_resume_lsn; /** Used for resuming flush notifier thread */ atomic_lsn_t flush_notifier_resume_lsn; /** Number of total I/O operations performed when we printed the statistics last time. */ mutable uint64_t n_log_ios_old; /** Wall time when we printed the statistics last time. */ mutable time_t last_printout_time; /** @} */ /**************************************************/ /** @name Recovery *******************************************************/ /** @{ */ /** Lsn from which recovery has been started. */ lsn_t recovered_lsn; /** Format of the redo log: e.g., LOG_HEADER_FORMAT_CURRENT. */ uint32_t format; /** Corruption status. */ log_state_t state; /** Used only in recovery: recovery scan succeeded up to this lsn. */ lsn_t scanned_lsn; #ifdef UNIV_DEBUG /** When this is set, writing to the redo log should be disabled. We check for this in functions that write to the redo log. */ bool disable_redo_writes; /** DEBUG only - if we copied or initialized the first block in buffer, this is set to lsn for which we did that. We later ensure that we start the redo log at the same lsn. Else it is zero and we would crash when trying to start redo then. */ lsn_t first_block_is_correct_for_lsn; #endif /* UNIV_DEBUG */ alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Fields protected by the log_limits mutex. Related to free space in the redo log. *******************************************************/ /** @{ */ /** Mutex which protects fields: available_for_checkpoint_lsn, requested_checkpoint_lsn. It also synchronizes updates of: free_check_limit_sn, concurrency_margin and dict_persist_margin. It also protects the srv_checkpoint_disabled (together with the checkpointer_mutex). */ mutable ib_mutex_t limits_mutex; /** A new checkpoint could be written for this lsn value. Up to this lsn value, all dirty pages have been added to flush lists and flushed. Updated in the log checkpointer thread by taking minimum oldest_modification out of the last dirty pages from each flush list. However it will not be bigger than the current value of log.buf_dirty_pages_added_up_to_lsn. Read by: user threads when requesting fuzzy checkpoint Read by: log_print() (printing status of redo) Updated by: log_checkpointer Protected by: limits_mutex. */ MY_COMPILER_DIAGNOSTIC_PUSH() MY_COMPILER_CLANG_WORKAROUND_REF_DOCBUG() /** @see @ref subsect_redo_log_available_for_checkpoint_lsn */ MY_COMPILER_DIAGNOSTIC_POP() lsn_t available_for_checkpoint_lsn; /** When this is larger than the latest checkpoint, the log checkpointer thread will be forced to write a new checkpoint (unless the new latest checkpoint lsn would still be smaller than this value). Read by: log_checkpointer Updated by: user threads (log_free_check() or for sharp checkpoint) Protected by: limits_mutex. */ lsn_t requested_checkpoint_lsn; /** Maximum lsn allowed for checkpoint by dict_persist or zero. This will be set by dict_persist_to_dd_table_buffer(), which should be always called before really making a checkpoint. If non-zero, up to this lsn value, dynamic metadata changes have been written back to mysql.innodb_dynamic_metadata under dict_persist->mutex protection. All dynamic metadata changes after this lsn have to be kept in redo logs, but not discarded. If zero, just ignore it. Updated by: DD (when persisting dynamic meta data) Updated by: log_checkpointer (reset when checkpoint is written) Protected by: limits_mutex. */ lsn_t dict_max_allowed_checkpoint_lsn; /** If should perform checkpoints every innodb_log_checkpoint_every ms. Disabled during startup / shutdown. Enabled in srv_start_threads. Updated by: starting thread (srv_start_threads) Read by: log_checkpointer */ bool periodical_checkpoints_enabled; /** Maximum sn up to which there is free space in the redo log. Threads check this limit and compare to current log.sn, when they are outside mini-transactions and hold no latches. The formula used to compute the limitation takes into account maximum size of mtr and thread concurrency to include proper margins and avoid issues with race condition (in which all threads check the limitation and then all proceed with their mini-transactions). Also extra margin is there for dd table buffer cache (dict_persist_margin). Read by: user threads (log_free_check()) Updated by: log_checkpointer (after update of checkpoint_lsn) Updated by: log_writer (after increasing concurrency_margin) Updated by: DD (after update of dict_persist_margin) Protected by (updates only): limits_mutex. */ atomic_sn_t free_check_limit_sn; /** Margin used in calculation of @see free_check_limit_sn. Read by: page_cleaners, log_checkpointer Updated by: log_writer Protected by (updates only): limits_mutex. */ atomic_sn_t concurrency_margin; /** Margin used in calculation of @see free_check_limit_sn. Read by: page_cleaners, log_checkpointer Updated by: DD Protected by (updates only): limits_mutex. */ atomic_sn_t dict_persist_margin; alignas(ut::INNODB_CACHE_LINE_SIZE) /** @} */ /**************************************************/ /** @name Log checkpointer thread *******************************************************/ /** @{ */ /** Event used by the log checkpointer thread to wait for requests. */ os_event_t checkpointer_event; /** Mutex which can be used to pause log checkpointer thread. This is used by log_position_lock() together with log_buffer_x_lock(), to pause any changes to current_lsn or last_checkpoint_lsn. */ mutable ib_mutex_t checkpointer_mutex; /** Latest checkpoint lsn. Read by: user threads, log_print (no protection) Read by: log_writer (under writer_mutex) Updated by: log_checkpointer (under both mutexes) Protected by (updates only): checkpointer_mutex + writer_mutex. */ MY_COMPILER_DIAGNOSTIC_PUSH() MY_COMPILER_CLANG_WORKAROUND_REF_DOCBUG() /** @see @ref subsect_redo_log_last_checkpoint_lsn */ MY_COMPILER_DIAGNOSTIC_POP() atomic_lsn_t last_checkpoint_lsn; /** Next checkpoint number. Read by: log_get_last_block (no protection) Read by: log_writer (under writer_mutex) Updated by: log_checkpointer (under both mutexes) Protected by: checkpoint_mutex + writer_mutex. */ std::atomic<checkpoint_no_t> next_checkpoint_no; /** Latest checkpoint wall time. Used by (private): log_checkpointer. */ Log_clock_point last_checkpoint_time; /** Aligned buffer used for writing a checkpoint header. It is aligned similarly to log.buf. Used by (private): log_checkpointer, recovery code */ aligned_array_pointer<byte, OS_FILE_LOG_BLOCK_SIZE> checkpoint_buf; /** @} */ /**************************************************/ /** @name Fields considered constant, updated when log system is initialized (log_sys_init()) and not assigned to particular log thread. *******************************************************/ /** @{ */ /** Capacity of the log files available for log_free_check(). */ lsn_t lsn_capacity_for_free_check; /** Capacity of log files excluding headers of the log files. If the checkpoint age exceeds this, it is a serious error, because in such case we have already overwritten redo log. */ lsn_t lsn_real_capacity; /** When the oldest dirty page age exceeds this value, we start an asynchronous preflush of dirty pages. */ lsn_t max_modified_age_async; /** When the oldest dirty page age exceeds this value, we start a synchronous flush of dirty pages. */ lsn_t max_modified_age_sync; /** When checkpoint age exceeds this value, we write checkpoints if lag between oldest_lsn and checkpoint_lsn exceeds max_checkpoint_lag. */ lsn_t max_checkpoint_age_async; /** @} */ /** true if redo logging is disabled. Read and write with writer_mutex */ bool m_disable; /** true, if server is not recoverable. Read and write with writer_mutex */ bool m_crash_unsafe; /** start LSN of first redo log file. * / lsn_t m_first_file_lsn; #endif /* !UNIV_HOTBACKUP * / };
redo log本质是存储在文件系统中的,通过mtr(mini transaction)的数据形式刷到盘上。这个里面有一个mtr_t:
/** Mini-transaction handle and buffer */ struct mtr_t { /** State variables of the mtr */ struct Impl { /** memo stack for locks etc. */ mtr_buf_t m_memo; /** mini-transaction log */ mtr_buf_t m_log; /** true if mtr has made at least one buffer pool page dirty */ bool m_made_dirty; /** true if inside ibuf changes */ bool m_inside_ibuf; /** true if the mini-transaction modified buffer pool pages */ bool m_modifications; /** true if mtr is forced to NO_LOG mode because redo logging is disabled globally. In this case, mtr increments the global counter at ::start and must decrement it back at ::commit. */ bool m_marked_nolog; /** Shard index used for incrementing global counter at ::start. We need to use the same shard while decrementing counter at ::commit. */ size_t m_shard_index; /** Count of how many page initial log records have been written to the mtr log */ ib_uint32_t m_n_log_recs; /** specifies which operations should be logged; default value MTR_LOG_ALL */ mtr_log_t m_log_mode; /** State of the transaction */ mtr_state_t m_state; /** Flush Observer */ FlushObserver *m_flush_observer; #ifdef UNIV_DEBUG /** For checking corruption. */ ulint m_magic_n; #endif /* UNIV_DEBUG */ /** Owning mini-transaction */ mtr_t *m_mtr; }; #ifndef UNIV_HOTBACKUP /** mtr global logging */ class Logging { public: /** mtr global redo logging state. Enable Logging : [ENABLED] -> [ENABLED_RESTRICT] -> [DISABLED] Disable Logging : [DISABLED] -> [ENABLED_RESTRICT] -> [ENABLED_DBLWR] -> [ENABLED] */ enum State : uint32_t { /* Redo Logging is enabled. Server is crash safe. */ ENABLED, /* Redo logging is enabled. All non-logging mtr are finished with the pages flushed to disk. Double write is enabled. Some pages could be still getting written to disk without double-write. Not safe to crash. */ ENABLED_DBLWR, /* Redo logging is enabled but there could be some mtrs still running in no logging mode. Redo archiving and clone are not allowed to start. No double-write */ ENABLED_RESTRICT, /* Redo logging is disabled and all new mtrs would not generate any redo. Redo archiving and clone are not allowed. */ DISABLED }; /** Initialize logging state at server start up. */ void init() { m_state.store(ENABLED); /* We use sharded counter and force sequentially consistent counting which is the general default for c++ atomic operation. If we try to optimize it further specific to current operations, we could use Release-Acquire ordering i.e. std::memory_order_release during counting and std::memory_order_acquire while checking for the count. However, sharding looks to be good enough for now and we should go for non default memory ordering only with some visible proof for improvement. */ m_count_nologging_mtr.set_order(std::memory_order_seq_cst); Counter::clear(m_count_nologging_mtr); } /** Disable mtr redo logging. Server is crash unsafe without logging. @param[in] thd server connection THD @return mysql error code. */ int disable(THD *thd); /** Enable mtr redo logging. Ensure that the server is crash safe before returning. @param[in] thd server connection THD @return mysql error code. */ int enable(THD *thd); /** Mark a no-logging mtr to indicate that it would not generate redo log and system is crash unsafe. @return true iff logging is disabled and mtr is marked. */ bool mark_mtr(size_t index) { /* Have initial check to avoid incrementing global counter for regular case when redo logging is enabled. */ if (is_disabled()) { /* Increment counter to restrict state change DISABLED to ENABLED. */ Counter::inc(m_count_nologging_mtr, index); /* Check if the no-logging is still disabled. At this point, if we find the state disabled, it is no longer possible for the state move back to enabled till the mtr finishes and we unmark the mtr. */ if (is_disabled()) { return (true); } Counter::dec(m_count_nologging_mtr, index); } return (false); } /** unmark a no logging mtr. */ void unmark_mtr(size_t index) { ut_ad(!is_enabled()); ut_ad(Counter::total(m_count_nologging_mtr) > 0); Counter::dec(m_count_nologging_mtr, index); } /* @return flush loop count for faster response when logging is disabled. */ uint32_t get_nolog_flush_loop() const { return (NOLOG_MAX_FLUSH_LOOP); } /** @return true iff redo logging is enabled and server is crash safe. */ bool is_enabled() const { return (m_state.load() == ENABLED); } /** @return true iff redo logging is disabled and new mtrs are not going to generate redo log. */ bool is_disabled() const { return (m_state.load() == DISABLED); } /** @return true iff we can skip data page double write. */ bool dblwr_disabled() const { auto state = m_state.load(); return (state == DISABLED || state == ENABLED_RESTRICT); } /* Force faster flush loop for quicker adaptive flush response when logging is disabled. When redo logging is disabled the system operates faster with dirty pages generated at much faster rate. */ static constexpr uint32_t NOLOG_MAX_FLUSH_LOOP = 5; private: /** Wait till all no-logging mtrs are finished. @return mysql error code. */ int wait_no_log_mtr(THD *thd); private: /** Global redo logging state. */ std::atomic<State> m_state; using Shards = Counter::Shards<128>; /** Number of no logging mtrs currently running. */ Shards m_count_nologging_mtr; }; /** Check if redo logging is disabled globally and mark the global counter till mtr ends. */ void check_nolog_and_mark(); /** Check if the mtr has marked the global no log counter and unmark it. */ void check_nolog_and_unmark(); #endif /* !UNIV_HOTBACKUP */ mtr_t() { m_impl.m_state = MTR_STATE_INIT; m_impl.m_marked_nolog = false; m_impl.m_shard_index = 0; } ~mtr_t() { #ifdef UNIV_DEBUG switch (m_impl.m_state) { case MTR_STATE_ACTIVE: ut_ad(m_impl.m_memo.size() == 0); ut_d(remove_from_debug_list()); break; case MTR_STATE_INIT: case MTR_STATE_COMMITTED: break; case MTR_STATE_COMMITTING: ut_error; } #endif /* UNIV_DEBUG */ #ifndef UNIV_HOTBACKUP /* Safety check in case mtr is not committed. */ if (m_impl.m_state != MTR_STATE_INIT) { check_nolog_and_unmark(); } #endif /* !UNIV_HOTBACKUP */ }
注意其中有一个日志定义:typedef dyn_buf_t<DYN_ARRAY_DATA_SIZE> mtr_buf_t;
然后是log_t这个数据结构:
struct alignas(ut::INNODB_CACHE_LINE_SIZE) log_t { #ifndef UNIV_HOTBACKUP /**************************************************/ /** @name Users writing to log buffer *******************************************************/ /** @{ */ /** Event used for locking sn */ os_event_t sn_lock_event; #ifdef UNIV_PFS_RWLOCK /** The instrumentation hook */ struct PSI_rwlock *pfs_psi; #endif /* UNIV_PFS_RWLOCK */ #ifdef UNIV_DEBUG /** The rw_lock instance only for the debug info list */ /* NOTE: Just "rw_lock_t sn_lock_inst;" and direct minimum initialization seem to hit the bug of Sun Studio of Solaris. */ rw_lock_t *sn_lock_inst; #endif /* UNIV_DEBUG */ /** Current sn value. Used to reserve space in the redo log, and used to acquire an exclusive access to the log buffer. Represents number of data bytes that have ever been reserved. Bytes of headers and footers of log blocks are not included. Its highest bit is used for locking the access to the log buffer. */ alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_sn_t sn; /** Intended sn value while x-locked. */ atomic_sn_t sn_locked; /** Mutex which can be used for x-lock sn value */ mutable ib_mutex_t sn_x_lock_mutex; /** Aligned log buffer. Committing mini-transactions write there redo records, and the log_writer thread writes the log buffer to disk in background. Protected by: locking sn not to add. */ alignas(ut::INNODB_CACHE_LINE_SIZE) ut::aligned_array_pointer<byte, LOG_BUFFER_ALIGNMENT> buf; /** Size of the log buffer expressed in number of data bytes, that is excluding bytes for headers and footers of log blocks. */ atomic_sn_t buf_size_sn; /** Size of the log buffer expressed in number of total bytes, that is including bytes for headers and footers of log blocks. */ size_t buf_size; /** The recent written buffer. Protected by: locking sn not to add. */ alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_written; /** Used for pausing the log writer threads. When paused, each user thread should write log as in the former version. */ std::atomic_bool writer_threads_paused; /** Some threads waiting for the ready for write lsn by closer_event. */ lsn_t current_ready_waiting_lsn; /** current_ready_waiting_lsn is waited using this sig_count. */ int64_t current_ready_waiting_sig_count; /** The recent closed buffer. Protected by: locking sn not to add. */ alignas(ut::INNODB_CACHE_LINE_SIZE) Link_buf<lsn_t> recent_closed; /** @} */ /**************************************************/ /** @name Users <=> writer *******************************************************/ /** @{ */ /** Maximum sn up to which there is free space in both the log buffer and the log files. This is limitation for the end of any write to the log buffer. Threads, which are limited need to wait, and possibly they hold latches of dirty pages making a deadlock possible. Protected by: writer_mutex (writes). */ alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_sn_t buf_limit_sn; /** Up to this lsn, data has been written to disk (fsync not required). Protected by: writer_mutex (writes). */ alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t write_lsn; /** Unaligned pointer to array with events, which are used for notifications sent from the log write notifier thread to user threads. The notifications are sent when write_lsn is advanced. User threads wait for write_lsn >= lsn, for some lsn. Log writer advances the write_lsn and notifies the log write notifier, which notifies all users interested in nearby lsn values (lsn belonging to the same log block). Note that false wake-ups are possible, in which case user threads simply retry waiting. */ alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *write_events; /** Number of entries in the array with writer_events. */ size_t write_events_size; /** Approx. number of requests to write/flush redo since startup. */ alignas(ut::INNODB_CACHE_LINE_SIZE) std::atomic<uint64_t> write_to_file_requests_total; /** How often redo write/flush is requested in average. Measures in microseconds. Log threads do not spin when the write/flush requests are not frequent. */ alignas(ut::INNODB_CACHE_LINE_SIZE) std::atomic<std::chrono::microseconds> write_to_file_requests_interval; static_assert(decltype(write_to_file_requests_interval)::is_always_lock_free); /** @} */ /**************************************************/ /** @name Users <=> flusher *******************************************************/ /** @{ */ /** Unaligned pointer to array with events, which are used for notifications sent from the log flush notifier thread to user threads. The notifications are sent when flushed_to_disk_lsn is advanced. User threads wait for flushed_to_disk_lsn >= lsn, for some lsn. Log flusher advances the flushed_to_disk_lsn and notifies the log flush notifier, which notifies all users interested in nearby lsn values (lsn belonging to the same log block). Note that false wake-ups are possible, in which case user threads simply retry waiting. */ alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t *flush_events; /** Number of entries in the array with events. */ size_t flush_events_size; /** This event is in the reset state when a flush is running; a thread should wait for this without owning any of redo mutexes, but NOTE that to reset this event, the thread MUST own the writer_mutex */ os_event_t old_flush_event; /** Up to this lsn data has been flushed to disk (fsynced). */ alignas(ut::INNODB_CACHE_LINE_SIZE) atomic_lsn_t flushed_to_disk_lsn; /** @} */ /**************************************************/ /** @name Log flusher thread *******************************************************/ /** @{ */ /** Last flush start time. Updated just before fsync starts. */ alignas(ut::INNODB_CACHE_LINE_SIZE) Log_clock_point last_flush_start_time; /** Last flush end time. Updated just after fsync is finished. If smaller than start time, then flush operation is pending. */ Log_clock_point last_flush_end_time; /** Flushing average time (in microseconds). */ double flush_avg_time; /** Mutex which can be used to pause log flusher thread. */ mutable ib_mutex_t flusher_mutex; alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t flusher_event; /** @} */ /**************************************************/ /** @name Log writer thread *******************************************************/ /** @{ */ /** Size of buffer used for the write-ahead (in bytes). */ alignas(ut::INNODB_CACHE_LINE_SIZE) uint32_t write_ahead_buf_size; /** Aligned buffer used for some of redo log writes. Data is copied there from the log buffer and written to disk, in following cases: - when writing ahead full kernel page to avoid read-on-write issue, - to copy, prepare and write the incomplete block of the log buffer (because mini-transactions might be writing new redo records to the block in parallel, when the block is being written to disk) */ ut::aligned_array_pointer<byte, LOG_WRITE_AHEAD_BUFFER_ALIGNMENT> write_ahead_buf; /** Up to this file offset in the log files, the write-ahead has been done or is not required (for any other reason). */ os_offset_t write_ahead_end_offset; /** File within which write_lsn is located, so the newest file in m_files in the same time - updates are protected by the m_files_mutex. This field exists, because the log_writer thread needs to locate offsets each time it writes data blocks to disk, but we do not want to acquire and release the m_files_mutex for each such write, because that would slow down the log_writer thread a lot. Instead of that, the log_writer uses this object to locate the offsets. Updates of this field require two mutexes: writer_mutex and m_files_mutex. Its m_id is updated only when the write_lsn moves to the next log file. */ Log_file m_current_file{m_files_ctx, m_encryption_metadata}; /** Handle for the opened m_current_file. The log_writer uses this handle to do writes (protected by writer_mutex). The log_flusher uses this handle to do fsyncs (protected by flusher_mutex). Both these threads might use this handle in parallel. The required synchronization between writes and fsyncs will happen on the OS side. When m_current_file is repointed to other file, this field is also updated, in the same critical section. Updates of this field are protected by: writer_mutex, m_files_mutex and flusher_mutex acquired all together. The reason for flusher_mutex is to avoid a need to acquire / release m_files_mutex in the log_flusher thread for each fsync. Instead of that, the log_flusher thread keeps the log_flusher_mutex, which is released less often, but still prevents from updates of this field. */ Log_file_handle m_current_file_handle{m_encryption_metadata}; /** True iff the log writer has entered extra writer margin and still hasn't exited since then. Each time the log_writer enters that margin, it pauses all user threads at log_free_check() calls and emits warning to the log. When the writer exits the extra margin, notice is emitted. Protected by: log_limits_mutex and writer_mutex. */ bool m_writer_inside_extra_margin; #endif /* !UNIV_HOTBACKUP */ /** Number of performed IO operations (only for printing stats). */ uint64_t n_log_ios; #ifndef UNIV_HOTBACKUP /** Mutex which can be used to pause log writer thread. */ mutable ib_mutex_t writer_mutex; #ifdef UNIV_DEBUG /** THD used by the log_writer thread. */ THD *m_writer_thd; #endif /* UNIV_DEBUG */ alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t writer_event; /** A recently seen value of log_consumer_get_oldest()->get_consumed_lsn(). It serves as a lower bound for future values of this expression, because it is guaranteed to be monotonic in time: each individual consumer can only go forward, and new consumers must start at least from checkpoint lsn, and the checkpointer is always one of the consumers. Protected by: writer_mutex. */ lsn_t m_oldest_need_lsn_lowerbound; /** @} */ /**************************************************/ /** @name Log closer thread *******************************************************/ /** @{ */ /** Event used by the log closer thread to wait for tasks. */ alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t closer_event; /** Mutex which can be used to pause log closer thread. */ mutable ib_mutex_t closer_mutex; /** @} */ /**************************************************/ /** @name Log flusher <=> flush_notifier *******************************************************/ /** @{ */ /** Event used by the log flusher thread to notify the log flush notifier thread, that it should proceed with notifying user threads waiting for the advanced flushed_to_disk_lsn (because it has been advanced). */ alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t flush_notifier_event; /** The next flushed_to_disk_lsn can be waited using this sig_count. */ int64_t current_flush_sig_count; /** Mutex which can be used to pause log flush notifier thread. */ mutable ib_mutex_t flush_notifier_mutex; /** @} */ /**************************************************/ /** @name Log writer <=> write_notifier *******************************************************/ /** @{ */ /** Mutex which can be used to pause log write notifier thread. */ alignas(ut::INNODB_CACHE_LINE_SIZE) mutable ib_mutex_t write_notifier_mutex; /** Event used by the log writer thread to notify the log write notifier thread, that it should proceed with notifying user threads waiting for the advanced write_lsn (because it has been advanced). */ alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t write_notifier_event; /** @} */ /**************************************************/ /** @name Log files management *******************************************************/ /** @{ */ /** Mutex protecting set of existing log files and their meta data. */ alignas(ut::INNODB_CACHE_LINE_SIZE) mutable ib_mutex_t m_files_mutex; /** Context for all operations on redo log files from log0files_io.h. */ Log_files_context m_files_ctx; /** The in-memory dictionary of log files. Protected by: m_files_mutex. */ Log_files_dict m_files{m_files_ctx}; /** Number of existing unused files (those with _tmp suffix). Protected by: m_files_mutex. */ size_t m_unused_files_count; /** Size of each unused redo log file, to which recently all unused redo log files became resized. Expressed in bytes. */ os_offset_t m_unused_file_size; /** Capacity limits for the redo log. Responsible for resize. Mutex protection is decided per each Log_files_capacity method. */ Log_files_capacity m_capacity; /** True iff log_writer is waiting for a next log file available. Protected by: m_files_mutex. */ bool m_requested_files_consumption; /** Statistics related to redo log files consumption and creation. Protected by: m_files_mutex. */ Log_files_stats m_files_stats; /** Event used by log files governor thread to wait. */ os_event_t m_files_governor_event; /** Event used by other threads to wait until log files governor finished its next iteration. This is useful when some sys_var gets changed to wait until log files governor re-computed everything and then check if the concurrency_margin is safe to emit warning if needed (the warning would still belong to the sys_var's SET GLOBAL statement then). */ os_event_t m_files_governor_iteration_event; /** False if log files governor thread is allowed to add new redo records. This is set as intention, to tell the log files governor about what it is allowed to do. To ensure that the log_files_governor is aware of what has been told, user needs to wait on @see m_no_more_dummy_records_promised. */ std::atomic_bool m_no_more_dummy_records_requested; /** False if the log files governor thread is allowed to add new dummy redo records. This is set to true only by the log_files_governor thread, and after it observed @see m_no_more_dummy_records_requested being true. It can be used to wait until the log files governor thread promises not to generate any more dummy redo records. */ std::atomic_bool m_no_more_dummy_records_promised; #ifdef UNIV_DEBUG /** THD used by the log_files_governor thread. */ THD *m_files_governor_thd; #endif /* UNIV_DEBUG */ /** Event used for waiting on next file available. Used by log writer thread to wait when it needs to produce a next log file but there are no free (consumed) log files available. */ os_event_t m_file_removed_event; /** Buffer that contains encryption meta data encrypted with master key. Protected by: m_files_mutex */ byte m_encryption_buf[OS_FILE_LOG_BLOCK_SIZE]; #endif /* !UNIV_HOTBACKUP */ /** Encryption metadata. This member is passed to Log_file_handle objects created for redo log files. In particular, the m_current_file_handle has a reference to this field. When encryption metadata is updated, it needs to be written to the redo log file's header. Also, each write performed by the log_writer thread needs to use m_encryption_metadata (it's passed by reference to the m_current_file_handle) and the log_writer does not acquire m_files_mutex for its writes (it is a hot path and it's better to keep it shorter). Therefore it's been decided that updates of this field require both m_files_mutex and writer_mutex. Protected by: m_files_mutex, writer_mutex */ Encryption_metadata m_encryption_metadata; #ifndef UNIV_HOTBACKUP /** @} */ /**************************************************/ /** @name Consumers *******************************************************/ /** @{ */ /** Set of registered redo log consumers. Note, that this object is not responsible for freeing them (does not claim to be owner). If you wanted to register or unregister a redo log consumer, then please use following functions: @see log_consumer_register() and @see log_consumer_unregister(). The details of implementation related to redo log consumers can be found in log0consumer.cc. Protected by: m_files_mutex (unless it is the startup phase or the shutdown phase). */ ut::unordered_set<Log_consumer *> m_consumers; /** @} */ /**************************************************/ /** @name Maintenance *******************************************************/ /** @{ */ /** Used for stopping the log background threads. */ alignas(ut::INNODB_CACHE_LINE_SIZE) std::atomic_bool should_stop_threads; /** Event used for pausing the log writer threads. */ os_event_t writer_threads_resume_event; /** Used for resuming write notifier thread */ atomic_lsn_t write_notifier_resume_lsn; /** Used for resuming flush notifier thread */ atomic_lsn_t flush_notifier_resume_lsn; /** Number of total I/O operations performed when we printed the statistics last time. */ mutable uint64_t n_log_ios_old; /** Wall time when we printed the statistics last time. */ mutable time_t last_printout_time; /** @} */ /**************************************************/ /** @name Recovery *******************************************************/ /** @{ */ /** Lsn from which recovery has been started. */ lsn_t recovered_lsn; /** Format of the redo log: e.g., Log_format::CURRENT. */ Log_format m_format; /** Log creator name */ std::string m_creator_name; /** Log flags */ Log_flags m_log_flags; /** Log UUID */ Log_uuid m_log_uuid; /** Used only in recovery: recovery scan succeeded up to this lsn. */ lsn_t m_scanned_lsn; #ifdef UNIV_DEBUG /** When this is set, writing to the redo log should be disabled. We check for this in functions that write to the redo log. */ bool disable_redo_writes; /** DEBUG only - if we copied or initialized the first block in buffer, this is set to lsn for which we did that. We later ensure that we start the redo log at the same lsn. Else it is zero and we would crash when trying to start redo then. */ lsn_t first_block_is_correct_for_lsn; #endif /* UNIV_DEBUG */ /** @} */ /**************************************************/ /** @name Fields protected by the log_limits_mutex. Related to free space in the redo log. *******************************************************/ /** @{ */ /** Mutex which protects fields: available_for_checkpoint_lsn, requested_checkpoint_lsn. It also synchronizes updates of: free_check_limit_sn, concurrency_margin, dict_persist_margin. It protects reads and writes of m_writer_inside_extra_margin. It also protects the srv_checkpoint_disabled (together with the checkpointer_mutex). */ alignas(ut::INNODB_CACHE_LINE_SIZE) mutable ib_mutex_t limits_mutex; /** A new checkpoint could be written for this lsn value. Up to this lsn value, all dirty pages have been added to flush lists and flushed. Updated in the log checkpointer thread by taking minimum oldest_modification out of the last dirty pages from each flush list. However it will not be bigger than the current value of log.buf_dirty_pages_added_up_to_lsn. Read by: user threads when requesting fuzzy checkpoint Read by: log_print() (printing status of redo) Updated by: log_checkpointer Protected by: limits_mutex. */ lsn_t available_for_checkpoint_lsn; /** When this is larger than the latest checkpoint, the log checkpointer thread will be forced to write a new checkpoint (unless the new latest checkpoint lsn would still be smaller than this value). Read by: log_checkpointer Updated by: user threads (log_free_check() or for sharp checkpoint) Protected by: limits_mutex. */ lsn_t requested_checkpoint_lsn; /** Maximum lsn allowed for checkpoint by dict_persist or zero. This will be set by dict_persist_to_dd_table_buffer(), which should be always called before really making a checkpoint. If non-zero, up to this lsn value, dynamic metadata changes have been written back to mysql.innodb_dynamic_metadata under dict_persist->mutex protection. All dynamic metadata changes after this lsn have to be kept in redo logs, but not discarded. If zero, just ignore it. Updated by: DD (when persisting dynamic meta data) Updated by: log_checkpointer (reset when checkpoint is written) Protected by: limits_mutex. */ lsn_t dict_max_allowed_checkpoint_lsn; /** If should perform checkpoints every innodb_log_checkpoint_every ms. Disabled during startup / shutdown. Enabled in srv_start_threads. Updated by: starting thread (srv_start_threads) Read by: log_checkpointer */ bool periodical_checkpoints_enabled; /** If checkpoints are allowed. When this is set to false, neither new checkpoints might be written nor lsn available for checkpoint might be updated. This is useful in recovery period, when neither flush lists can be trusted nor DD dynamic metadata redo records might be reclaimed. This is never set from true to false after log_start(). */ std::atomic_bool m_allow_checkpoints; /** Maximum sn up to which there is free space in the redo log. Threads check this limit and compare to current log.sn, when they are outside mini-transactions and hold no latches. The formula used to compute the limitation takes into account maximum size of mtr and thread concurrency to include proper margins and avoid issues with race condition (in which all threads check the limitation and then all proceed with their mini-transactions). Also extra margin is there for dd table buffer cache (dict_persist_margin). Read by: user threads (log_free_check()) Updated by: log_checkpointer (after update of checkpoint_lsn) Updated by: log_writer (after pausing/resuming user threads) Updated by: DD (after update of dict_persist_margin) Protected by (updates only): limits_mutex. */ atomic_sn_t free_check_limit_sn; /** Margin used in calculation of @see free_check_limit_sn. Protected by (updates only): limits_mutex. */ atomic_sn_t concurrency_margin; /** True iff current concurrency_margin isn't truncated because of too small redo log capacity. Protected by (updates only): limits_mutex. */ std::atomic<bool> concurrency_margin_is_safe; /** Margin used in calculation of @see free_check_limit_sn. Read by: page_cleaners, log_checkpointer Updated by: DD Protected by (updates only): limits_mutex. */ atomic_sn_t dict_persist_margin; /** @} */ /**************************************************/ /** @name Log checkpointer thread *******************************************************/ /** @{ */ /** Event used by the log checkpointer thread to wait for requests. */ alignas(ut::INNODB_CACHE_LINE_SIZE) os_event_t checkpointer_event; /** Mutex which can be used to pause log checkpointer thread. This is used by log_position_lock() together with log_buffer_x_lock(), to pause any changes to current_lsn or last_checkpoint_lsn. */ mutable ib_mutex_t checkpointer_mutex; /** Latest checkpoint lsn. Read by: user threads, log_print (no protection) Read by: log_writer (under writer_mutex) Updated by: log_checkpointer (under both mutexes) Protected by (updates only): checkpointer_mutex + writer_mutex. */ atomic_lsn_t last_checkpoint_lsn; /** Next checkpoint header to use. Updated by: log_checkpointer Protected by: checkpointer_mutex */ Log_checkpoint_header_no next_checkpoint_header_no; /** Event signaled when last_checkpoint_lsn is advanced by the log_checkpointer thread. */ os_event_t next_checkpoint_event; /** Latest checkpoint wall time. Used by (private): log_checkpointer. */ Log_clock_point last_checkpoint_time; /** Redo log consumer which is always registered and which is responsible for protecting redo log records at lsn >= last_checkpoint_lsn. */ Log_checkpoint_consumer m_checkpoint_consumer{*this}; #ifdef UNIV_DEBUG /** THD used by the log_checkpointer thread. */ THD *m_checkpointer_thd; #endif /* UNIV_DEBUG */ /** @} */ #endif /* !UNIV_HOTBACKUP */ };
为了防止日志空洞,使用了一个数据结构link_buf:
template <typename Position = uint64_t> class Link_buf { public: /** Type used to express distance between two positions. It could become a parameter of template if it was useful. However there is no such need currently. */ typedef Position Distance; /** Constructs the link buffer. Allocated memory for the links. Initializes the tail pointer with 0. @param[in] capacity number of slots in the ring buffer */ explicit Link_buf(size_t capacity); Link_buf(); Link_buf(Link_buf &&rhs); Link_buf(const Link_buf &rhs) = delete; Link_buf &operator=(Link_buf &&rhs); Link_buf &operator=(const Link_buf &rhs) = delete; /** Destructs the link buffer. Deallocates memory for the links. */ ~Link_buf(); /** Add a directed link between two given positions. It is user's responsibility to ensure that there is space for the link. This is because it can be useful to ensure much earlier that there is space. @param[in] from position where the link starts @param[in] to position where the link ends (from -> to) */ void add_link(Position from, Position to); /** Add a directed link between two given positions. It is user's responsibility to ensure that there is space for the link. This is because it can be useful to ensure much earlier that there is space. In addition, advances the tail pointer in the buffer if possible. @param[in] from position where the link starts @param[in] to position where the link ends (from -> to) */ void add_link_advance_tail(Position from, Position to); /** Advances the tail pointer in the buffer by following connected path created by links. Starts at current position of the pointer. Stops when the provided function returns true. @param[in] stop_condition function used as a stop condition; (lsn_t prev, lsn_t next) -> bool; returns false if we should follow the link prev->next, true to stop @param[in] max_retry max fails to retry @return true if and only if the pointer has been advanced */ template <typename Stop_condition> bool advance_tail_until(Stop_condition stop_condition, uint32_t max_retry = 1); /** Advances the tail pointer in the buffer without additional condition for stop. Stops at missing outgoing link. @see advance_tail_until() @return true if and only if the pointer has been advanced */ bool advance_tail(); /** @return capacity of the ring buffer */ size_t capacity() const; /** @return the tail pointer */ Position tail() const; /** Checks if there is space to add link at given position. User has to use this function before adding the link, and should wait until the free space exists. @param[in] position position to check @return true if and only if the space is free */ bool has_space(Position position); /** Validates (using assertions) that there are no links set in the range [begin, end). */ void validate_no_links(Position begin, Position end); /** Validates (using assertions) that there no links at all. */ void validate_no_links(); private: /** Translates position expressed in original unit to position in the m_links (which is a ring buffer). @param[in] position position in original unit @return position in the m_links */ size_t slot_index(Position position) const; /** Computes next position by looking into slots array and following single link which starts in provided position. @param[in] position position to start @param[out] next computed next position @return false if there was no link, true otherwise */ bool next_position(Position position, Position &next); /** Deallocated memory, if it was allocated. */ void free(); /** Capacity of the buffer. */ size_t m_capacity; /** Pointer to the ring buffer (unaligned). */ std::atomic<Distance> *m_links; /** Tail pointer in the buffer (expressed in original unit). * / alignas(ut::INNODB_CACHE_LINE_SIZE) std::atomic<Position> m_tail; };
日志是比较复杂的,Redo Log(重做日志)与 LSN(log sequece number)二者的联系,在8.0中的并行操作等等,其中redo log提交的无锁化也就是并发的写log buffer,同样,link_buf的引入就是为了解决在并发写的时候导致log buffer并发写产生的空洞。这个有点和共识的写入有点相似的感觉。
这篇分析了InnoDB中的内存缓冲用到的基本数据结构,关于内存使用的流程和数据结构管理的相关部分代码,会在下一篇继续分析。在前面分析了MEM_ROOT,可以看到,MEM_ROOT是基于Server层的内存管理的,包括 Sharing和线程内存。这篇分析的则是数据库引擎InnoDB层的内存管理,但是他们底层都是调用的同样的内存分析接口(sbrk,mmap等)。很多的数据结构在一开始遇到后可能会感到有些无法区分它们之间的关系,但认真分析设计层面就可以清晰的划分出二者的应用场景。就如同样是拧螺丝的,有人拧的是电视机的,有人则是电冰箱的。
知道了这一点,也就明白了,MySql数据库本身占用的内存为什么远远大于InnoDB Buffer Pools配置的参数的大小的值的原因。
追根溯源,理清脉络,这是阅读代码最简单有效的方式。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。