当前位置:   article > 正文

Indexes and Indexing_mysql index seek和index scan

mysql index seek和index scan

许多因素决定了 MySQL 的性能,但索引是最为特殊的,因为没有它们就无法实现性能。您可以删除其他因素(查询【query】、模式【schema】、数据【data】等)并仍然获得性能,但删除索引会将性能限制为蛮力:依赖于硬件的速度和容量。如果这本书的标题是《Brute Force MySQL Performance》,那么内容将与标题一样长:“Buy better, faster hardware”。你可能会笑,但就在几天前,我遇到了一个开发团队,他们一直在通过购买更快的硬件来提高云计算的性能,直到高昂的成本迫使他们问:“我们还能如何提高性能?”。

MySQL利用硬件、优化和索引来实现访问数据时的性能。硬件是一个明显的优势,因为MySQL运行在硬件上:硬件越快,性能越好。不太明显但可能更令人惊讶的是,硬件提供的优势最少。我一会儿会解释为什么。优化指的是许多技术、算法和数据结构,这些技术、算法和数据结构使MySQL能够有效地利用硬件。优化集中了硬件的力量。焦点就是灯泡和激光之间的区别。因此,优化比硬件提供了更多的优势。如果数据库很小,那么硬件和优化就足够了。但是增加数据大小会减少硬件和优化的好处。如果没有索引,性能将受到严重限制。

MySQL 在访问数据时利用硬件【hardware】、优化【Optimizations】和索引【indexes】来实现性能。硬件是一个明显的优势,因为 MySQL 在硬件上运行:硬件越快,性能越好。不太明显但可能更令人惊讶的是,硬件提供的杠杆作用是最小的。我一会儿会解释为什么。优化【Optimizations】是指使 MySQL 能够有效利用硬件的众多技术【techniques】、算法【algorithms】和数据结构【data structures】。优化【Optimizations】使硬件的力量成为焦点。而这个焦点是灯泡和激光之间的区别。因此,优化【Optimizations】提供了比硬件更多的杠杆作用。如果数据库很小,只利用硬件【hardware】和优化【Optimizations】就足够了。但是增加数据大小会降低硬件和优化的好处。如果没有索引,性能会受到严重限制。

为了说明这些要点,可以把 MySQL 看作一个支点,它利用硬件【hardware】、优化【Optimizations】和索引【indexes】来提升数据【data】,如图 2-1 所示。

Figure 2-1. MySQL performance without indexes

如果没有索引(在右侧),MySQL 在相对较小的数据量下实现了有限的性能。但如果在天平中添加索引【indexes】,如图 2-2 所示,MySQL 则可以实现大数据下的高性能。

 

Figure 2-2. MySQL performance with indexes

索引【indexes】提供了最多和最好的杠杆。它们对于任何大数据量都是必需的。 MySQL 性能需要适当的索引【indexes】和索引【indexing】,本章将详细介绍这两者。

几年前,我设计并实现了一个存储大量数据的应用程序。最初,我估计最大的表不会超过一百万行。但是数据归档代码中存在一个错误,导致表可以达到 10 亿行。多年来,没有人注意到,因为响应时间总是很好。为什么?好的索引。

本章讲授 MySQL 索引【indexes】和索引【indexing】。有五个主要部分:

  1. 第一个讨论为什么你不应该被硬件或 MySQL 调优分心。这是一个必要的题外话,以充分理解为什么硬件和MySQL调优不是提高MySQL性能的有效解决方案。
  2. 第二个是对 MySQL 索引【indexes】的直观介绍:它们是什么以及它们是如何工作的。
  3. 第三个教索引【indexing】——通过像 MySQL 一样思考来应用索引【indexes】,以获得最大的杠杆作用。
  4. 第四个涵盖了索引【indexes 】失去效力(杠杆)的常见原因。
  5. 第五是对 MySQL 表连接算法的简要概述,因为有效的连接依赖于有效的索引。

Red Herrings of Performance

红鲱鱼是一个习语,指从目标上分散注意力。在寻找提高 MySQL 性能的解决方案时,工程师会分心于两件事:更快的硬件和 Mysql 调优。

Better, Faster Hardware!

MySQL Tuning

MySQL Indexes: A Visual Introduction

索引是性能的关键,如果您回想一下“直接查询优化”,对查询和索引的更改可以解决很多性能问题。查询优化的过程需要对 MySQL 索引有深入的了解,这就是本节介绍的内容——通过大量插图详细介绍。

虽然这一节很详细,而且相对较长,但我将其称为介绍,因为还有更多内容需要学习。但是这个部分是打开 MySQL 查询优化宝库的钥匙。

下面的9个部分只适用于 InnoDB 表上的标准索引——由简单的 PRIMARY KEY 或 [UNIQUE] INDEX 表定义创建的索引类型。MySQL支持其他专门的索引类型,但我不会在本书中介绍它们,因为标准索引是性能的基础。

在我们深入了解 MySQL 索引的细节之前,我先从 InnoDB 表开始,它不仅会改变您查看索引的方式,还会改变大多数 MySQL 性能的方式。

InnoDB Tables Are Indexes

例 2-1 是表 elem ( element 的缩写)的结构,它包含 10 行数据。本章中的所有例子都引用了表 ele(只有一个明确指出的例外,所以花点时间来研究一下它。

  1. CREATE TABLE `elem` (
  2. `id` int unsigned NOT NULL,
  3. `a` char(2) NOT NULL,
  4. `b` char(2) NOT NULL,
  5. `c` char(2) NOT NULL,
  6. PRIMARY KEY (`id`),
  7. KEY `idx_a_b` (`a`,`b`)
  8. ) ENGINE=InnoDB;
  9. +----+------+------+------+
  10. | id | a | b | c |
  11. +----+------+------+------+
  12. | 1 | Ag | B | C |
  13. | 2 | Au | Be | Co |
  14. | 3 | Al | Br | Cr |
  15. | 4 | Ar | Br | Cd |
  16. | 5 | Ar | Br | C |
  17. | 6 | Ag | B | Co |
  18. | 7 | At | Bi | Ce |
  19. | 8 | Al | B | C |
  20. | 9 | Al | B | Cd |
  21. | 10 | Ar | B | Cd |
  22. +----+------+------+------+

Example 2-1. Table elem

表 elem 有两个索引:

  • 列 id 上的主键
  • 列 a、b 上的非唯一二级索引。

列 id 上的值是一个单调递增的整数。 a、b 和 c 列上的值是与列名字母对应的原子符号:“Ag”(银)表示 a 列,“B”(硼)表示 b 列,依此类推。行值是随机的,没有意义;这只是一个用于示例的简单表格。

图 2-3 显示了表 elem 的一个典型视图——为简洁起见,仅显示前四行。

Figure 2-3. Table elem: visual model

表 elem 没什么特别的,对吧?它是如此简单,有人可能会说它太初级了。但是如果我告诉你它不是一个真正的表,它是一个索引呢?图 2-4 显示了作为 InnoDB 表的表 elem 的真实结构。

Figure 2-4. Table elem: InnoDB B-tree index

InnoDB 表是由主键组织的 B-tree 索引。行【Rows】是存储在索引结构的叶节点中的索引记录【index records】。每个索引记录【index records】都有用于行锁定【ow locking】、事务隔离【 transaction isolation】等的元数据(用“...”表示)。 

图 2-4 是表 elem 的 B 树索引的高度简化描述。四个索引记录【index records】(图底部)对应于前四行。主键列值(1、2、3 和 4)显示在每个索引记录【index records】的顶部。其他列值(“Ag”、“B”、“C”等)显示在每个索引记录【index records】的元数据下方。

您无需了解 InnoDB B-tree 索引的技术细节即可理解或实现卓越的 MySQL 性能。只有两点很重要:

  • 主键查找非常快速和高效

  • 主键是MySQL性能的关键

第一点是毋庸置疑,因为 B-tree 索引本质上是快速和高效的,这也是许多数据库服务器使用它们的一个原因。

第二点在接下来的章节中会越来越清晰。

要了解数据库内部的迷人世界,包括索引,请阅读 Alex Petrov 的 Database Internals(O'Reilly,2019 年)。要深入了解InnoDB 内部结构,包括其 B-tree 实现,请取消所有的会议,并访问著名的 MySQL 专家 Jeremy Cole 的网站。

NOTE InnoDB 主键是聚簇索引【clustered index】。MySQL 手册有时将主键称为聚簇索引【clustered index】。

因为二级索引中也包括主键值,所以这一点尤其重要。图 2-5 显示了 a、b 列的二级索引。

Figure 2-5. Secondary index on columns a, b

二级索引也是 B-tree 索引,但叶节点存储主键值。当 MySQL 使用二级索引来查找一行时,它会对主键进行二次查找以读取整个行。让我们把这二者放在一起,通过二级索引查询 SELECT * FROM elem WHERE a='Au' AND b='Be'

Figure 2-6. Secondary index lookup for value “Au, Be”

图 2-6 显示了顶部的二级索引(列 a、b)和底部的主键(列 id)。六个标注(带编号的圆圈)显示使用二级索引查找值“Au,Be”的时序:

  1. 索引查找从根节点开始;针对值“Au,Be”,分支到右内部节点。

  2. 在内部节点上,针对值 “Au,Be”,分支到右叶节点。

  3. 二级索引值 “Au,Be”的叶节点中包含对应的主键值:2。

  4. 从根节点开始查找主键;针对值 2 ,分支到左内部节点。

  5. 在内部节点上,针对值 2 ,分支到右叶节点。

  6. 主键值 2 的叶节点包含匹配“Au,Be”查询条件的完整行。

NOTE 一个表只有一个主键。所有其他索引都是二级索引。

这一节很短,但非常重要,因为正确的模型为理解索引和其他内容提供了基础。例如,如果您回想一下前面的“Lock time”,,您可能会从一个新的角度看待它,因为行实际上是主键中的叶节点。知道 InnoDB 表是它的主键类似于知道太阳系的正确模型是日心说,而不是地心说。在 MySQL 的世界里,一切都围绕着主键。

Table Access Methods

使用索引查找行【rows】是三种表访问方法之一。因为表就是索引,所以索引查找【index lookup】是最好也是最常用的访问方法。但有时,根据查询的不同,索引查找【index lookup】是不可能的,唯一的办法是索引扫描【index scan】或表扫描【table scan 】——这是其他表访问方法。了解MySQL用于查询的访问方法是必要的,因为性能需要索引查找。避免索引扫描和表扫描。“EXPLAIN:查询执行计划”展示了如何查看访问方法。但首先,让我们来理清并形象化它们。

我们有必要知道 MySQL 查询时使用的哪种表访问方法,因为性能需要索引查找【index lookup】。避免索引扫描【index scans】和表扫描【table scans】。“EXPLAIN: Query Execution Plan” 展示了如何查看表访问方法。但首先,让我们对每一种表访问方法澄清并形象化。

NOTE MySQL手册使用术语访问方法【access method】、访问类型【access type】和连接类型【join type】。EXPLAIN 使用名为 type 或 access_type 的字段来引用这些术语。在MySQL中,这两个术语关系密切,但但使用含糊不清。

在本书中,为了准确性和一致行,我只使用了两个术语:访问方法【access method】和访问类型【access type】。 共有三种访问方法【access method】:索引查找【index lookup】、索引扫描【index scan】和表扫描【table scan】。 对于索引查找【index lookup】,有几种访问类型【access type】:ref、eq_ref、range 等等。

Index lookup

索引查找通过利用索引的有序结构【 ordered structure】和算法【algorithmic】来查找特定行(或行范围【ranges of rows】)。 这是最快的访问方法,因为这正是索引设计的目的:快速高效地访问大量数据。 因此,索引查找【 index lookups】对于直接查询优化【direct query optimization.】至关重要。 性能要求几乎每个查询对每个表都使用索引查找【index lookuo】。 索引查找有几种访问类型【access type】,我将在后面的章节中介绍,比如“WHERE”。

前一节中的图 2-6 显示了使用二级索引的索引查找【index lookups】。

Index scan

当无法进行索引查找【index lookups时,MySQL 必须使用蛮力查找行:读取所有行并过滤掉不匹配的行。 在 MySQL 使用主键读取每一行之前,它尝试使用二级索引读取行。这被称为索引扫描【index scan】。

有两种类型的索引扫描【index scan】

第一种是全索引扫描【full index scan】,这意味着 MySQL 按索引顺序【index order】读取所有行。 读取所有行通常对性能很不利,但是按索引顺序【index order】读取它们,可以避免在索引顺序【index order】与查询中的 ORDER BY 相匹配时对行进行排序。

图 2-7 显示了查询 SELECT * FROM elem FORCE INDEX (a) ORDER BY a, b 的全索引扫描【full index scan】。 FORCE INDEX 子句是必需的,因为由于表 elem 很小,MySQL 扫描主键并对行进行排序比扫描二级索引并按顺序获取行更有效。 (有时糟糕的查询会成为很好的例子。)

图 2-7 有八个标注(带编号的圆圈),显示了行访问【row acces】的顺序:

  1. 读取二级索引【secondary index (SI)】的第一个值 : “Ag, B.”

  2. 通过主键【primary key (PK)】查找对应的行 .

  3. 读取二级索引【secondary index (SI)】的第二个值: “Al, Br.”

  4. 通过主键【primary key (PK)】查找对应的行 .

  5. 读取二级索引【secondary index (SI)】的第三个值: “Ar, Br.”

  6. 通过主键【primary key (PK)】查找对应的行.

  7. 读取二级索引【secondary index (SI)】的第四个值: “Au, Be.”

  8. 通过主键【primary key (PK)】查找对应的行.

Figure 2-7. Full index scan on secondary index

图 2-7 中有一个微妙但重要的细节:按顺序扫描二级索引可能是顺序读取,但主键查找几乎可以肯定是随机读取。 按索引顺序【index order】访问行并不能保证顺序读取【sequential reads】; 更有可能的是,它会导致随机读取【random reads】。

NOTE 顺序访问(读取和写入)比随机访问快。

第二种索引扫描【index scan】是仅索引扫描【index-only scan】:MySQL 从索引中读取列值(不是整行)。 这需要一个覆盖索引【covering index】,稍后将在“Covering Indexes”中介绍。 仅索引扫描【index-only scan】应该比全索引扫描【full index scan】更快,因为它不需要主键查找读取完整的行; 它只从二级索引中读取列值,这就是它需要覆盖索引【covering index】的原因。

除非唯一的选择是全表扫描【full table scan】,否则不要针对索引扫描【index scan】进行优化。 否则,请避免索引扫描【index scan】。

Table scan

(全)表扫描按主键顺序读取所有行。 当 MySQL 无法进行索引查找【index lookup】或索引扫描【index scan】时,表扫描【table scan】是唯一的选择。 这通常对性能很不利,但通常也更容易修复,因为 MySQL 擅长使用索引,并且有许多基于索引的优化。 基本上每个带有 WHERE、GROUP BY 或 ORDER BY 子句的查询都可以使用索引(即使只是索引扫描),因为这些子句使用列【columns】并且列【columns】可以被索引。 因此,无法修复表扫描【table scan】的原因几乎为零。

图 2-8 显示了全表扫描:按主键顺序读取所有行。 它有四个标注,显示行访问的顺序。 表 elem 很小,这里只显示了四行,但想象一下 MySQL 在数千或数百万行的真实数据库表中是如何艰难挣扎的。

一般建议和最佳实践是避免表扫描【table scan】。 但要想进行全面而平衡的讨论,有两种情况,表扫描【table scan】可能是可以接受的或(令人惊讶的)更好的:

Figure 2-8. Full table scan

但是不要认为任何表扫描【table scan】是理所当然的:它们通常不利于性能。 在极少数情况下,当可以进行索引查找时,MySQL 可能会错误地选择表扫描,比如“It’s a Trap! (When MySQL Chooses Another Index)”

Leftmost Prefix Requirement

要使用索引,查询【query】必须使用索引的最左前缀【leftmost prefix】:一个或多个索引列从索引定义所指定的最左索引列开始。 最左前缀是必需的,因为底层索引结构是按索引列顺序排序的,并且只能按该顺序遍历(搜索)。

NOTE 使用 SHOW CREATE TABLE 或 SHOW INDEX 查看索引定义。

图 2-9 显示了在一个 a、b、c 三列的索引上,WHERE 子句使用每一个做左前缀:a 列; a,b列; 和 a,b,c 列。

Figure 2-9. Leftmost prefixes of a three-column index

图 2-9 中顶部的 WHERE 子句使用列 a,它是索引的最左列。 中间的 WHERE 子句使用列 a 和 b,它们一起构成索引的最左前缀。 底部的 WHERE 子句使用整个索引:a,b,c 三列。 使用索引的所有列是理想的,但这不是必需的; 只有最左前缀是必需的。 索引列可用于其他 SQL 子句,如下面的许多例子所示。

TIP 要使用索引,查询必须使用索引的最左前缀。

最左前缀条件有两个逻辑结论:

  • 索引 (a, b) 和 (b, a) 是不同的索引。 它们以不同的顺序索引相同的列,这导致不同的最左前缀。 但是,同时使用这两列的查询【query】(例如,WHERE a = 'Au' AND b = 'Be')可以使用任何一个索引,但这并不意味着索引在性能方面是等价的。MySQL将通过计算许多因素来选择两者中较好的。
  • MySQL 很可能使用索引 (a, b, c) 来代替索引 (a) 和 (a, b),因为后两个是第一个的最左前缀。 在这种情况下,索引 (a) 和 (a, b) 是重复的,可以删除。 使用 pt-duplicate-key-checker 来查找和报告重复索引。

潜伏在每个二级索引的末尾(最右)的是主键。 对于表 elem(示例 2-1),二级索引实际上是 (a, b, id),但最右的 id 是隐藏的。 MySQL 不显示附加到二级索引的主键; 你得自己想象一下它。

NOTE 主键附加到每个二级索引上: (S, P) 其中 S 是二级索引列,P 是主键列。

在 MySQL 的行话中,我们说“主键被附加到二级索引”,即使它不是字面上附加的。(你可以通过创建索引 (a, b, id) 来附加它,但不要这样做)。“附加到”实际上意味着二级索引叶节点包含主键值,如图 2-5 所示。 这很重要,因为它增加了每个二级索引的大小:主键值在二级索引中重复/冗余。 更大的索引需要更多的内存,这意味着更少的索引可以容纳在内存中。 保持主键的大小较小,二级索引的数量合理。就在前几天,我的同事正在帮助一个团队,该团队的数据库在 397 GB 的数据(主键)上具有 693 GB 的二级索引。

最左前缀的条件既是祝福也是限制。 使用额外的二级索引相对容易解决该限制,但请等到阅读“Excessive, Duplicate, and Unused”之后“Excessive, Duplicate, and Unused”。 考虑到该限制,连接表是一个特殊的挑战,但我在“Join Tables”.中解决了这个问题。 我鼓励您将最左前缀条件求视为一种祝福。 关于索引的查询优化并非易事,但最左前缀条件是这段旅程中一个简单而熟悉的起点。

EXPLAIN: Query Execution Plan

MySQL EXPLAIN 命令显示的是一个查询执行计划(或 EXPLAIN plan),它描述了 MySQL 计划如何执行查询:表连接顺序【 table join order,】、表访问方法【table access method】、索引使用【index usage】和其他重要细节。

EXPLAIN 的输出是广泛而多样的。 此外,它完全依赖于查询【query】。 更改查询【query】中的单个字符可能会显著更改其 EXPLAIN plan。 例如,WHERE id = 1 与 WHERE id > 1 产生明显不同的 EXPLAIN plan。 为了使事情进一步复杂,EXPLAIN 还在继续发展。 MySQL 手册中的“EXPLAIN Output Format”是必读的——即使是专家。 幸运的是,为了我们的理智,几十年来基本原理始终保持不变。

为了说明索引的使用,接下来的五个部分解释了 MySQL 可以使用索引的每种情况的查询:

还有其他特定情况,例如 MIN() 和 MAX(),但这五种情况是索引使用的主要来源。

但首先,我需要通过查看示例 2-2 中所示的 EXPLAIN 输出字段的含义来设置阶段。

  1. EXPLAIN SELECT * FROM elem WHERE id = 1\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. type: const
  8. possible_keys: PRIMARY
  9. key: PRIMARY
  10. key_len: 4
  11. ref: const
  12. rows: 1
  13. filtered: 100.00
  14. Extra: NULL

Example 2-2. EXPLAIN output (traditional format)

在此介绍汇总,我们忽略字段 id、select_type、partitions、key_len 和 filters; 但是这些示例中包含了它们,以使您习惯于输出。 其余七个字段传达了构成查询执行计划【query execution plan】的丰富信息:

  • table:该字段是表名【table name】(或别名【alias】)或子查询引用【subquery reference】。 表是按照 MySQL 确定的连接顺【 join order】[列出的,而不是它们在查询中出现的顺序。 最上面的表是第一个表,最下面的表是最后一个表。
  • type:该字段是表访问方法【 table access method】或索引查找访问类型【index lookup access type】——请参阅“Table Access Methods” 中的第一个注释说明。 ALL 表示全表扫描【full table scan】(参见“Table scan”)。 index 表示索引扫描【index scan】(参见“Index scan”)。 任何其他值(const、ref、range 等等)都是索引查找【index lookup 】的访问类型(请参阅 “Index lookup”)。
  • possible_keys:该字段列出了 MySQL 可以使用的索引,因为查询【query】使用了最左前缀。 如果该字段中未列出索引,则不满足最左前缀要求条件。
  • key:该字段是 MySQL 将使用的索引的名称,如果没有索引可以使用,则为 NULL。 MySQL 根据许多因素选择最佳索引,其中一些在 Extra 字段中指示。 可以肯定的是 MySQL 在执行查询时会使用这个索引(EXPLAIN 并不会执行查询),但请参阅 “It’s a Trap! (When MySQL Chooses Another Index)”
  • ref:该字段列举出了用于在索引(即 key 字段)中查找行的值的来源。
  • rows:该字段是 MySQL 将检查以查找匹配行的估计行数。 MySQL 使用索引统计信息【index statistics】来估计行数,因此可以期待真实的数字——“Rows examined”——既接近又不同。
  • Extra:该字段提供有关查询执行计划的附加信息。 该字段很重要,因为它指示 MySQL 可以应用的查询优化(如果有的话)。

NOTE 本书中的所有 EXPLAIN 输出都是传统格式【 traditional format】:表格输出【tabular output】(EXPLAIN query;)或列表输出【list outpu】(EXPLAIN query\G)。 其他格式是 JSON(EXPLAIN FORMAT=JSON query)和从 MySQL 8.0.16 开始的 tree (EXPLAIN FORMAT=TREE query)。 JSON 和 tree 格式与传统格式【 traditional format】完全不同,但所有格式都传达了查询执行计划。

不要期望在没有上下文的情况下从这些字段中收集很多信息:表【tables】、索引【indexes】、数据【data】和查询【query】。 在以下部分中,所有插图均参考自表 elem(示例 2-1)、它的两个索引和表中的十行数据。

WHERE

MySQL 可以使用索引来查找与 WHERE 子句中的表条件匹配的行。 我谨慎地说 MySQL 可以使用索引,而不是 MySQL 会使用索引,因为索引的使用取决于几个因素,主要是:表条件【table conditions】、索引【index】和最左前缀条件(参见 “Leftmost Prefix Requirement”)。 (还有其他因素,比如索引统计【index statistics】和优化器成本【optimizer costs】,但它们超出了本书的范围。)

表条件【table conditions】是关于列【column】及其值之间的匹配【matches】、分组【groups】、聚合【aggregates】或行排序【orders rows】(简而言之,当条件明确时,我使用术语条件【condition 】)在 WHERE 子句中,表条件【table conditions 】也称为谓词【predicates】。

图 2-10 显示在主键为 id 时,一个带有单个条件的 WHERE 子句:id = 1。

实心框描绘了 MySQL 可以使用的表条件【table condition】和索引列【index column】(也称为索引部分【index part】),因为前者(表条件)是后者(索引)的最左前缀。 箭头从表条件【table condition】指向它使用的索引列【index column】。 稍后,我们将看到 MySQL 无法使用的表条件【table condition】和索引列【index column】的示例。 

在图 2-10 中,MySQL 可以通过使用主键列 id ,找到匹配条件 id = 1 的行。 示例 2-3 是完整查询的 EXPLAIN plan。

  1. EXPLAIN SELECT * FROM elem WHERE id = 1\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. type: const
  8. possible_keys: PRIMARY
  9. key: PRIMARY
  10. key_len: 4
  11. ref: const
  12. rows: 1
  13. filtered: 100.00
  14. Extra: NULL

Example 2-3. EXPLAIN plan for primary key lookup

在示例 2-3 中,key: PRIMARY 确认 MySQL 将使用主键——索引查找【index lookup】。 相应地,访问类型【access type】(即 type 字段)不是 ALL (表扫描【table scan】)或 index (索引扫描【index scan】),这是一个简单的主键查找所期望的。 二级索引未列在 possible_keys 字段中,因为 MySQL 无法将其用于此查询:列 id 不满足列 a, b 上的二级索引的最左前缀条件。

访问类型【access type】 const 是一种特殊情况,仅当主键或唯一的二级索引的所有索引列上存在常量条件(ref:const)时才会发生。 结果是一个 constant row。 这有点太深入了,但是既然我们学在这里,让我们继续学习。 给定表数据(示例 2-1)和列 id 是主键的事实,id = 1 标识的行可以视为常量,因为执行查询时,id = 1 只能匹配一行( 或没有行)。 MySQL 读取这一行并将其值视为常量,这对响应时间非常有用:const 访问非常快。

Extra:NULL ,这比较少见,因为实际查询比本示例更复杂。 但在这里,Extra: NULL 意味着 MySQL 不需要匹配行。 为什么? 因为常量行只能匹配一行(或没有行)。 但是匹配行才是常态,所以让我们看一个更现实的例子,将表条件改为 id > 3 AND id < 6 AND c = 'Cd',如图 2-11 所示,其对应的 EXPLAIN plan 见实例 2-4。

Figure 2-11. WHERE: range access using primary key

  1. EXPLAIN SELECT * FROM elem WHERE id > 3 AND id < 6 AND c = 'Cd'\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. > type: range
  8. possible_keys: PRIMARY
  9. key: PRIMARY
  10. key_len: 4
  11. > ref: NULL
  12. > rows: 2
  13. filtered: 10.00
  14. > Extra: Using where

Example 2-4. EXPLAIN plan for range access using primary key

NOTE 为了突出 EXPLAIN PLAN 有哪些变更,我将 > 字符添加到更改了的相关字段中。 这些高亮处并不是 EXPLAIN 的一部分。

通过将表条件【table condition】更改为 id > 3 AND id < 6 AND c = 'Cd',EXPLAIN PLAN 从示例 2-3 更改为示例 2-4,这对于单表查询来说更现实一点。 该查询仍然使用主键(key: PRIMARY),但访问类型【access type】更改为范围扫描(type: range):使用索引来读取值介于范围之间的行。 在这种情况下,MySQL 使用主键读取列 id 的值在 3 到 6 之间的行。 ref 字段为 NULL,因为列 id 上的条件不是常量的(并且这是一个单表查询,所以没有前面的表可以引用)。 条件 c = 'Cd' 是常量,但它不用于索引查找【index lookup】(范围扫描【range scan】),因此 ref 不起作用。 MySQL 估计它将检查范围内的两行数据(row:2)。这对于这个简单的小例子来说是正确的,但是请记住:rows是一个估计值。

Extra 字段中的“Using where”非常常见,这是意料之中的。 这意味着 MySQL 将使用 WHERE 条件查找匹配的行【matching rows】:对于读取的每一行,如果所有 WHERE 条件都为真,则匹配行。 由于列 id 上的条件定义了范围,因此 MySQL 将使用列 c 上的条件来匹配范围内的行【rows】。 回顾示例 2-1,有一行匹配所有 WHERE 条件:

  1. +----+------+------+------+
  2. | id | a | b | c |
  3. +----+------+------+------+
  4. | 4 | Ar | Br | Cd |
  5. +----+------+------+------+

id = 5 的行在范围内,因此 MySQL 会检查该行,但其列 c 值(“Cd”)与 WHERE 子句不匹配,因此 MySQL 不会返回该行。

为了说明其他查询执行计划【query execution plans】,让我们使用二级索引最左前缀,如图 2-12 所示,其对应的 EXPLAIN PLAN 见示例 2-5 。

Figure 2-12. WHERE: secondary index lookups

  1. EXPLAIN SELECT * FROM elem WHERE a = 'Au'\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. > type: ref
  8. possible_keys: idx_a_b
  9. > key: idx_a_b
  10. key_len: 3
  11. ref: const
  12. rows: 1
  13. filtered: 100.00
  14. Extra: NULL
  15. EXPLAIN SELECT * FROM elem WHERE a = 'Au' AND b = 'Be'\G
  16. *************************** 1. row ***************************
  17. id: 1
  18. select_type: SIMPLE
  19. table: elem
  20. partitions: NULL
  21. > type: ref
  22. possible_keys: idx_a_b
  23. > key: idx_a_b
  24. key_len: 6
  25. ref: const,const
  26. rows: 1
  27. filtered: 100.00
  28. Extra: NULL

Example 2-5. EXPLAIN plans for secondary index lookups

对于示例 2-5 中的每个 EXPLAIN PLAN,key: idx_a_b 确认 MySQL 使用二级索引,因为条件满足最左前缀条件。 第一个 WHERE 子句仅使用一个索引部分:a 列。 第二个 WHERE 子句使用两个索引部分:列 a 和 b。 仅使用 b 列不满足最左前缀条件——我稍后会展示这一点。

与先前的 EXPLAIN PLAN 相比,这里新增的重要部分的是访问类型【access type】,即 type:ref。 用最简单的术语来说,ref 访问类型是在索引(即 key 字段)的最左前缀上进行的相等(= 或 <=>)查找。 与任何索引查找【index lookup】一样,只要估计要检查的行数(即 rows 字段)是合理的,ref 访问就非常快。

虽然条件是常量,但是 const 访问类型是不可能的,因为索引(key: idx_a_b)是非唯一的,所以查找可能匹配上多行。 尽管 MySQL 估计每个 WHERE 子句将只检查一行(rows: 1),但在执行查询时可能会发生变化。

Extra:NULL 再次出现,因为 MySQL 可以仅使用索引找到匹配的行,因为非索引列上没有条件 - 所以现在让我们添加一个试试。 图 2-13 显示了一个 WHERE 子句,在 a 和 c 列上都有条件,示例 2-6 是相应的 EXPLAIN PLAN。

Figure 2-13. WHERE: index lookup and non-indexed column

  1. EXPLAIN SELECT * FROM elem WHERE a = 'Al' AND c = 'Co'\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. type: ref
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 3
  11. ref: const
  12. > rows: 3
  13. filtered: 10.00
  14. > Extra: Using where

Example 2-6. EXPLAIN plan for index lookup and non-indexed column

在图 2-13 中,条件 c = 'Co' 周围没有方框,因为索引没有覆盖 c 列。 MySQL 仍然使用二级索引(key:idx_a_b),但列 c 上的条件阻止了 MySQL 仅使用索引来匹配行【matching rows】。相反,MySQL 使用索引来查找并读取满足列 a 上的条件的行,然后再匹满足配列 c 上的条件的行(Extra: Using where)。

再次回顾示例 2-1,您会注意到没有任何行匹配这个 WHERE 子句,但 EXPLAIN 报告的是rows: 3。 为什么? a 列上的索引查找【index lookup】匹配 a = 'Al' 为 true 的是 3 行:行 id 的值分别为 3、8 和 9。但这些行都不匹配 c = 'Co'。 该查询检查 3 行,但匹配 0 行。

作为索引 WHERE 和 EXPLAIN 的最后一个示例,我们不满足最左前缀条件,如图 2-14 和示例 2-7 所示。

Figure 2-14. WHERE without leftmost prefix

  1. EXPLAIN SELECT * FROM elem WHERE b = 'Be'\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. > type: ALL
  8. possible_keys: NULL
  9. > key: NULL
  10. key_len: NULL
  11. ref: NULL
  12. rows: 10
  13. filtered: 10.00
  14. Extra: Using where

Example 2-7. EXPLAIN plan for WHERE without leftmost prefix

虚线框轮廓(并且没有箭头)描绘了 MySQL 不能使用的表条件【table condition】和索引列【index column】,因为它们不满足最左前缀条件。

在图 2-14 中,a 列上没有条件,因此索引不能用于 b 列上的条件。 EXPLAIN PLAN(示例 2-7)证实了这一点: possible_keys:NULL 和 key:NULL。 如果没有索引,MySQL 将被迫进行全表扫描:即 type: ALL。 同样, rows: 10 反映总行数,而 Extra: Using where 反映 MySQL 读取然后过滤不匹配 b = 'Be' 的行。

例 2-7 是一个最坏可能的 EXPLAIN PLAN 例子。 每当你看到 type: ALL、 possible_keys: NULL 或 key: NULL 时,停止你正在做的事情并分析查询。

尽管这些示例很简单,但它们代表了 EXPLAIN 关于索引和 WHERE 子句的基本原理。 真正的查询有更多的索引和 WHERE 条件,但基本原理没有改变。

GROUP BY

MySQL 可以使用索引来优化 GROUP BY,因为值是按索引顺序【index order】隐式分组的。 对于二级索引 idx_a_b(在 a、b 列上),关于 a 列上的值有 5 种不同的分组,如示例 2-8 所示。

  1. SELECT a, b FROM elem ORDER BY a, b;
  2. +------+------+
  3. | a | b |
  4. +------+------+
  5. | Ag | B | -- Ag group
  6. | Ag | B |
  7. | Al | B | -- Al group
  8. | Al | B |
  9. | Al | Br |
  10. | Ar | B | -- Ar group
  11. | Ar | Br |
  12. | Ar | Br |
  13. | At | Bi | -- At group
  14. | Au | Be | -- Au group
  15. +------+------+

Example 2-8. Distinct groups of column a values

我用空行分隔示例 2-8 中的分组【groups】,并在每个分组的第一行进行注释。 带有 GROUP BY a 的查询【query 】可以使用索引 idx_a_b 因为列 a 是最左前缀,并且该索引按列 a 的值隐式分组。 示例 2-9 是最简单类型的 GROUP BY 优化的代表性 EXPLAIN PLAN。

  1. EXPLAIN SELECT a, COUNT(*) FROM elem GROUP BY a\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. > type: index
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 6
  11. ref: NULL
  12. rows: 10
  13. filtered: 100.00
  14. > Extra: Using index

Example 2-9. EXPLAIN plan for GROUP BY a

key: idx_a_b 确认 MySQL 使用索引来优化 GROUP BY。 由于索引是有序的,MySQL 可以确保列 a 的每个新值都是一个新的分组【group】。 例如,在读取最后一个“Ag”值后,索引顺序确保不再读取到“Ag”值,因此“Ag”分组就完成了。 

Extra 字段中的“Using index”表示 MySQL 只从索引中读取 a 列的值; 它没有根据主键读取完整的行。 我将会在“Covering Indexes”.”中介绍了这种优化。

此查询【query】使用索引,但不用于索引查找【index lookup】:type: index 表示索引扫描【index scan】(请参阅“Index scan”)。 由于没有 WHERE 子句来过滤行,MySQL 会读取所有行。 如果添加一个 WHERE 子句,MySQL 仍然可以为 GROUP BY 使用索引,但最左前缀条件仍然适用。 在这种情况下,查询【query】使用最左索引部分(即 a 列),因此 WHERE 条件必须在 a 或 b 列上才能满足最左前缀要求。 我们首先在 a 列上添加 WHERE 条件,如图 2-15 和示例 2-10 所示。

Figure 2-15. GROUP BY and WHERE on same index column

  1. EXPLAIN SELECT a, COUNT(a) FROM elem WHERE a != 'Ar' GROUP BY a\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. > type: range
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 3
  11. ref: NULL
  12. rows: 7
  13. filtered: 100.00
  14. > Extra: Using where; Using index

Example 2-10. EXPLAIN plan for GROUP BY and WHERE on same index column

Extra 字段中的“Using where”指的是 WHERE a != 'Ar'。 有趣的变化是 type: range。 范围访问类型【 range access type 】与不等运算符(!= 或 <>)一起使用。 你可以把它想象成 WHERE a < 'Ar' AND a > 'Ar',如图 2-16 所示。 

WHERE 子句中 b 列的条件仍然可以使用索引,因为条件无论在不同的 SQL 子句中,都满足最左前缀条件。 图 2-17 显示了这一点,示例 2-11 显示了 EXPLAIN PLAN。

Figure 2-17. GROUP BY and WHERE on different index columns

  1. EXPLAIN SELECT a, b FROM elem WHERE b = 'B' GROUP BY a\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. type: range
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 6
  11. ref: NULL
  12. rows: 6
  13. filtered: 100.00
  14. > Extra: Using where; Using index for group-by

Example 2-11. EXPLAIN plan for GROUP BY and WHERE on different index columns

示例 2-11 中的查询【query】有两个重要的细节:在 WHERE 子句中列 b 的相等条件,以及在 SELECT 子句中查询列 a 和 b。 这些细节启用了 Extra 字段中显示的特殊“Using index for group-by”优化。 例如,如果相等 (=) 更改为不相等 (!=),则查询优化将丢失。 当涉及到这样的查询优化时,细节至关重要。 您必须阅读 MySQL 手册以了解和应用这些细节。 “GROUP BY Optimization”在MySQL手册中有详细说明。

图 2-18 和示例 2-12 中的最后一个 GROUP BY 示例可能会让您感到惊讶。

Figure 2-18. GROUP BY without leftmost prefix

  1. EXPLAIN SELECT b, COUNT(*) FROM elem GROUP BY b\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. > type: index
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 6
  11. ref: NULL
  12. rows: 10
  13. filtered: 100.00
  14. > Extra: Using index; Using temporary

Example 2-12. EXPLAIN plan for GROUP BY without leftmost prefix

注意 key:idx_a_b:尽管查询【query】在 a 列上没有条件,但 MySQL 依然使用索引。 最左前缀条件上发生了什么? 这是因为 MySQL 正在扫描 a 列上的索引(type: index)。 您可以想象在 a 列上有一个恒真条件,例如 a = a。

如果是 GROUP BY c ,MySQL 是否仍会在列 a 上进行索引扫描【index scan】?不,它不会; 它会进行全表扫描【full table scan】。 图 2-18 可以工作,因为索引具有 b 列值; 但是它没有 c 列值。

Extra 字段中的“Using temporary”是在没有严格的最左前缀条件集下的副作用。 当 MySQL 从索引中读取 a 列的值时,它会在临时表(内存中)中收集 b 列的值。 在读取所有列 a 值后,它会扫描临时表以对 COUNT(*) 进行分组和聚合。

关于 GROUP BY 在索引和查询优化方面还有很多需要了解,但这些示例是基础。 与 WHERE 子句不同,GROUP BY 子句往往更简单。 挑战在于创建索引以优化 GROUP BY 以及其他 SQL 子句。 MySQL 在制定查询执行计划【query execution plan】时也面临同样的挑战,因此即使可能,它也可能不会优化 GROUP BY。 MySQL 几乎总是选择最佳的查询执行计划,但如果您想尝试不同的计划,请阅读 MySQL 手册中的“Index Hints” 。

ORDER BY

不出所料,MySQL 可以使用有序索引【ordered inde】来优化 ORDER BY。 这种优化通过按顺序访问行来避免对行进行排序,这需要更多时间。 如果没有这种优化,MySQL 会读取所有匹配的行,对它们进行排序,然后返回排序后的结果集。 当 MySQL 对行进行排序时,它会在 EXPLAIN PLAN 的 Extra 字段中打印“Using filesort”。 Filesort 表示 sort rows.。 这是一个历史性的(现在是误导性的)术语,但仍然是 MySQL 术语中的流行术语。

Filesort 让工程师们大吃一惊,因为它以速度慢着称。 对行进行排序是额外的工作,因此它不会提高响应时间,但通常不是响应时间缓慢的根本原因。 在本节的最后,我使用 MySQL 8.0.18 中新增的 EXPLAIN ANALYZE 来测量 Filesort 的实时损失。(剧透:行排序非常快)但首先,让我们看看如何使用索引来优化 ORDER BY。

使用索引来优化 ORDER BY 有三种方法。 第一种也是最简单的方法是为 ORDER BY 子句使用索引的最左侧前缀。 对于表 elem,这意味着:

  • ORDER BY id

  • ORDER BY a

  • ORDER BY a, b

第二种方法是保留索引常量【index constant】的最左部分,并按下一个索引列排序。例如,保持列 a 为常量,按列 b 排序,如图2-19所示,对应例 2-13 中的 EXPLAIN PLAN。 

Figure 2-19. ORDER BY and WHERE on different index columns

  1. EXPLAIN SELECT a, b FROM elem WHERE a = 'Ar' ORDER BY b\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. type: ref
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 3
  11. ref: const
  12. rows: 3
  13. filtered: 100.00
  14. Extra: Using index

Example 2-13. EXPLAIN plan for ORDER BY and WHERE on different index columns

WHERE a = 'Ar' ORDER BY b 可以使用索引 (a, b) 因为第一个索引部分(a 列)上的 WHERE 条件是恒定的,因此 MySQL 跳转到索引 a = 'Ar' 那里,并从那里按顺序读取 b 列的值。 示例 2-14 是结果集,虽然没有什么花哨的,但它表明 a 列是常量(值“Ar”),b 列是排序的。

  1. +------+------+
  2. | a | b |
  3. +------+------+
  4. | Ar | B |
  5. | Ar | Br |
  6. | Ar | Br |
  7. +------+------+

Example 2-14. Result set of WHERE a = 'Ar' ORDER BY b

如果表 elem 在列 a、b、c 上有一个索引,则像 WHERE a = 'Au' AND b = 'Be' ORDER BY c 这样的查询可以使用该索引,因为列 a 和 b 上的条件满足了索引的最左部分。 

第三种方式是第二种方式的特例。 在展示解释它的图之前,看看你是否能确定为什么示例 2-15 中的查询不会导致 filesort (即为什么 Extra 字段中没有报告“Using filesort ”)。

  1. EXPLAIN SELECT * FROM elem WHERE a = 'Al' AND b = 'B' ORDER BY id\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. type: ref
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 16
  11. ref: const,const
  12. rows: 2
  13. filtered: 100.00
  14. > Extra: Using index condition

Example 2-15. EXPLAIN plan for ORDER BY id

查询使用索引 idx_a_b 是可以理解的,因为 WHERE 条件是最左前缀,但 ORDER BY id 不应该导致 filesort  吗? 图 2-20 揭示了答案。 

Figure 2-20. ORDER BY using primary key appended to secondary index

“Leftmost Prefix Requirement”开头有这么一段话“隐藏在每个二级索引的末尾(最右)的是主键”。 这就是图 2-20 中发生的情况:索引列 id 周围的黑框显示了附加到二级索引的“隐藏”主键。 这种 ORDER BY 优化对于像 elem 这样的小表似乎没有用,但对于真正的表,它可能非常有用——值得记住。 

为了证明“隐藏”主键允许 ORDER BY 避免 filesort  ,让我们删除列 b 上的条件以使优化无效,如图 2-21 所示,EXPLAIN PLAN 如示例 2-16 。

Figure 2-21. ORDER BY without leftmost prefix

  1. EXPLAIN SELECT * FROM elem WHERE a = 'Al' ORDER BY id\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: elem
  6. partitions: NULL
  7. type: ref
  8. possible_keys: idx_a_b
  9. key: idx_a_b
  10. key_len: 8
  11. ref: const
  12. rows: 3
  13. filtered: 100.00
  14. > Extra: Using index condition; Using filesort

Example 2-16. EXPLAIN plan for ORDER BY without leftmost prefix

通过删除 b 列上的条件,二级索引上不再有允许 MySQL 使用“隐藏”主键来优化 ORDER BY 的最左前缀。 因此,对于这个特定的查询,“Using filesort”出现在 Extra 字段中。 

在这里,新的优化是“Using index condition”,称为索引条件下推【 index condition pushdown】。 索引条件下推【 index condition pushdown】意味着存储引擎使用索引来匹配满足 WHERE 条件的行。 通常,存储引擎只读取和写入行,而 MySQL 处理匹配满足条件的行的逻辑。 这是一个清晰的关注点分离(这是软件设计的优点),但是当行不匹配时效率很低:MySQL 和存储引擎都浪费时间在读取不匹配的行上了。 对于示例 2-16 中的查询,索引条件下推意味着存储引擎(InnoDB)使用索引 idx_a_b 来匹配条件 a = 'Al'。 索引条件下推有助于提高响应时间,但不要试图优化它,因为 MySQL 会尽可能自动使用它。 要了解更多信息,请阅读 MySQL 手册中的“Index Condition Pushdown Optimization” 。

有一个重要的细节会影响所有 ORDER BY 优化:索引顺序默认为升序,而 ORDER BY col 表示升序:ORDER BY col ASC。 对于所有列,优化 ORDER BY 仅在一个方向上起作用:ASC(升序)或 DESC(降序)。 因此,ORDER BY a, b DESC 不起作用,因为 a 列是隐式 ASC 排序,与 b DESC 不同。

NOTE MySQL 8.0支持降序索引【descending indexes.】。

文件排序的实时惩罚是多少? 在 MySQL 8.0.18 之前,它既没有测量也没有报告。 但是从 MySQL 8.0.18 开始,EXPLAIN ANALYZE 可以测量并报告它。 仅对于示例 2-17,我必须使用不同的表。

  1. CREATE TABLE `sbtest1` (
  2. `id` int NOT NULL AUTO_INCREMENT,
  3. `k` int NOT NULL DEFAULT '0',
  4. `c` char(120) NOT NULL DEFAULT '',
  5. `pad` char(60) NOT NULL DEFAULT '',
  6. PRIMARY KEY (`id`),
  7. KEY `k_1` (`k`)
  8. ) ENGINE=InnoDB;

Example 2-17. Sysbench table sbtest

那是一个标准的 sysbench 表; 我加载了一百万行。 让我们使用一个随机的、无意义的查询,它有一个大的结果集和 ORDER BY: 

  1. SELECT c FROM sbtest1 WHERE k < 450000 ORDER BY id;
  2. -- Output omitted
  3. 68439 rows in set (1.15 sec)

该查询需要 1.15 秒来排序并返回超过 68,000 行。 但这不是一个糟糕的查询。 查看其 EXPLAIN plan:

  1. EXPLAIN SELECT c FROM sbtest1 WHERE k < 450000 ORDER BY id\G
  2. *************************** 1. row ***************************
  3. id: 1
  4. select_type: SIMPLE
  5. table: sbtest1
  6. partitions: NULL
  7. type: range
  8. possible_keys: k_1
  9. key: k_1
  10. key_len: 4
  11. ref: NULL
  12. rows: 133168
  13. filtered: 100.00
  14. Extra: Using index condition; Using MRR; Using filesort

该 EXPLAIN 计划中唯一的新信息是 Extra 字段中的“Using MRR”,它指的是“Multi-Range Read Optimization”。 除此之外,该 EXPLAIN PLAN 报告的信息本章中都已涵盖。

filesort 会使这个查询变慢吗? EXPLAIN ANALYZE 揭示了答案,尽管含糊其辞:

  1. EXPLAIN ANALYZE SELECT c FROM sbtest1 WHERE k < 450000 ORDER BY id\G
  2. *************************** 1. row ***************************
  3. 1 -> Sort: sbtest1.id (cost=83975.47 rows=133168)
  4. 2 (actual time=1221.170..1229.306 rows=68439 loops=1)
  5. 3 -> Index range scan on sbtest1 using k_1, with index condition: (k<450000)
  6. 4 (cost=83975.47 rows=133168) (actual time=40.916..1174.981 rows=68439)

EXPLAIN ANALYZE 的实际输出更宽,但是为了便于打印和参考,我对行进行了换行和编号。 EXPLAIN ANALYZE 输出很密集,需要多加练习才能理解; 现在,让我们直奔主题——或者尽可能直截了当,因为输出不是按顺序读取的。 在第 4 行,1174.981(毫秒)表示索引范围扫描【 index range scan】(第 3 行)耗时 1.17 秒(四舍五入)。 在第 2 行,1221.170..1229.306 表示 filesort (第 1 行)在 1,221 毫秒后开始并在 1,229 毫秒后结束,这意味着 filesort  耗时 8 毫秒。 总执行时间为 1.23 秒:95% 的时间读取行【reading rows】和不到 1% 的时间排序行【sorting rows】。 剩下的 4%(大约 49 毫秒)用于其他阶段:准备、统计、记录、清理等。

答案是否定的:filesort 不会使这个查询变慢。 问题是数据访问:68,439 行不是一个小的结果集。 对于每秒执行数十亿次操作的 CPU,对 68,439 个值进行排序实际上是零工作。 但是对于必须遍历索引、管理事务等的关系数据库来说,读取 68,439 行是一项可观的工作。要优化这样的查询,请关注“Data Access”

最后一个要解决的问题:为什么 filesort 以速度慢着称? 因为 MySQL 在排序数据超过 sort_buffer_size 时使用磁盘上的临时文件,而硬盘驱动器比内存慢几个数量级。 几十年前,当旋转磁盘成为常态时,情况尤其如此。 但今天,SSD 已成为常态,而且存储速度总体上相当快。 对于高吞吐量 (QPS) 的查询,filesort 可能是一个问题,但请使用 EXPLAIN ANALYZE 进行测量和验证。

现在回到表 elem(示例 2-1)和 MySQL 可以使用索引的下一个案例:覆盖索引【covering indexes】。

Covering Indexes

覆盖索引包括查询【query】中引用的所有列。 图 2-22 显示了 SELECT 语句的覆盖索引。

Figure 2-22. Covering indexes

 列 a 和 b 上的 WHERE 条件像往常一样指向相应的索引列,但这些索引列也指向 SELECT 子句中的相应列,以表示这些列的值是从索引中读取的。

通常,MySQL 从主键读取完整的行(回想一下“InnoDB 表是索引”)。 但是使用覆盖索引,MySQL 只需从索引中读取列值。 这对二级索引最有帮助,因为它避免了主键查找。

MySQL 自动使用覆盖索引优化,EXPLAIN 在 Extra 字段中将其报告为“Using index”。 “Using index for group-by”是特定于 GROUP BY 和 DISTINCT 的类似优化,如 “GROUP BY”中所讲的那样。 但是“Using index condition”和“Using index for skip scan”是完全不同且不相关的优化。

索引扫描(type: index)加上覆盖索引(Extra: Using index)就是仅索引扫描【index-only scan】(请参阅“Index scan”)。“GROUP BY”:中有两个示例:示例 2-9 和示例 2-12。

覆盖索引很吸引人,但很少实用,因为实际查询有太多列、条件和子句,一个索引无法全部覆盖。 不要花时间尝试创建覆盖索引。 在设计或分析使用很少列的简单查询时,请花点时间看看覆盖索引是否可行。 如果是这样,那么恭喜。 如果没有,那没关系; 没有人期望覆盖索引。

Join Tables

MySQL 使用索引来连接表,这种用法与使用索引进行其他任何事情基本相同。 主要区别在于每个表的连接条件中使用的值的来源。 这在可视化时变得更加清晰,但首先我们需要第二个表来连接。 示例 2-18 显示了表 elem_names 的结构及其包含的 14 行。

  1. CREATE TABLE `elem_names` (
  2. `symbol` char(2) NOT NULL,
  3. `name` varchar(16) DEFAULT NULL,
  4. PRIMARY KEY (`symbol`)
  5. ) ENGINE=InnoDB;
  6. +--------+-----------+
  7. | symbol | name |
  8. +--------+-----------+
  9. | Ag | Silver |
  10. | Al | Aluminum |
  11. | Ar | Argon |
  12. | At | Astatine |
  13. | Au | Gold |
  14. | B | Boron |
  15. | Be | Beryllium |
  16. | Bi | Bismuth |
  17. | Br | Bromine |
  18. | C | Carbon |
  19. | Cd | Cadmium |
  20. | Ce | Cerium |
  21. | Co | Cobalt |
  22. | Cr | Chromium |
  23. +--------+-----------+

Example 2-18. Table elem_names

表 elem_name 有一个索引:列 symbol上的主键。 列 symbol 中的值与表 elem 列 a、b 和 c 中的值匹配。 因此,我们可以在这些列上连接表 elem 和 elem_names。

图 2-23 显示了连接表 elem 和 elem_names 的 SELECT 语句,以及每个表的条件和索引的可视化表示。

在前面的图中,只有一个索引和 SQL 子句的序对,因为只有一个表。 但是图 2-23 有两个序对——每个表一个——用大的向右的 chevrons 形勾勒出来,每个序对都注释了表名:/* elem */ 和 /* elem_names */。 与 EXPLAIN 一样,这些图按连接顺序列举出表:从上到下。 表 elem(顶部)是连接顺序中的第一个表,表 elem_names(底部)是第二个表。

Figure 2-23. Join table on primary key lookup

 表 elem 上的索引使用并不是什么新鲜或特殊的事情:MySQL 将索引用于条件 a IN (...)。 到目前为止,一切都很好。

表 elem_names (与前一个表相连接) 上的索引使用,从根本上是相同的,只有两个微小的区别。首先,WHERE 子句是对 JOIN...ON 子句的重写——稍后会详细介绍。其次,列 symbol 的条件值来自上一个表:elem。为了表示这一点,一个箭头从上一个表指向尖括号中的列引用:<elem.a>。在表连接时,MySQL 使用来自表 elem 中匹配行的列 a 值来查找表 elem_names 中的行,以获得列符号上的连接条件。 在 MySQL 术语中,我们会说“symbol 等于表 elem 中的列 a”。 给定上一个表中的值,对列 symbol 的主键查找并不是什么新鲜事或特别之处:如果某一行匹配,则将其返回并与上一个表中的行连接。

示例 2-19 显示了图 2-23 中 SELECT 语句的 EXPLAIN PLAN。

  1. Chapter 2. Indexes and Indexing
  2. Many factors determine MySQL performance, but indexes are special because performance cannot be achieved without them. You can remove other factors—queries, schemas, data, and so onand still achieve performance, but removing indexes limits performance to brute force: relying on the speed and capacity of hardware. If this book were titled Brute Force MySQL Performance, the contents would be as long as the title: “Buy better, faster hardware.” You laugh, but just a few days ago I met with a team of developers who had been improving performance in the cloud by purchasing faster hardware until stratospheric costs compelled them to ask, “How else can we improve performance?”
  3. MySQL leverages hardware, optimizations, and indexes to achieve performance when accessing data. Hardware is an obvious leverage because MySQL runs on hardware: the faster the hardware, the better the performance. Less obvious and perhaps more surprising is that hardware provides the least leverage. I explain why in a moment. Optimizations refer to the numerous techniques, algorithms, and data structures that enable MySQL to utilize hardware efficiently. Optimizations bring the power of hardware into focus. And focus is the difference between a light bulb and a laser. Consequently, optimizations provide more leverage than hardware. If databases were small, hardware and optimizations would be sufficient. But increasing data size deleverages the benefits of hardware and optimizations. Without indexes, performance is severely limited.
  4. To illustrate these points, think of MySQL as a fulcrum that leverages hardware, optimizations, and indexes to figuratively lift data, as shown in Figure 2-1.
  5. emsp 0201
  6. Figure 2-1. MySQL performance without indexes
  7. Without indexes (on the right side), MySQL achieves limited performance with relatively small data. But add indexes to the balance, as shown in Figure 2-2, and MySQL achieves high performance with large data.
  8. emsp 0202
  9. Figure 2-2. MySQL performance with indexes
  10. Indexes provide the most and the best leverage. They are required for any nontrivial amount of data. MySQL performance requires proper indexes and indexing, both of which this chapter teaches in detail.
  11. Several years ago, I designed and implemented an application that stores a lot of data. Originally, I estimated the largest table not to exceed a million rows. But there was a bug in the data archiving code that allowed the table to reach one billion rows. For years, nobody noticed because response time was always great. Why? Good indexes.
  12. NOTE
  13. It’s commonly said that MySQL uses only one index per table, but that’s not entirely true. The index merge optimization, for example, can use two indexes. In this book, however, I focus on the normal case: one query, one table, one index.
  14. This chapter teaches MySQL indexes and indexing. There are five major sections. The first argues why you should not be distracted by hardware or MySQL tuning. It’s a necessary digression in order to fully understand why hardware and MySQL tuning are not efficient solutions for improving MySQL performance. The second is a visual introduction to MySQL indexes: what they are and how they work. The third teaches indexing—applying indexes for maximum leverage—by thinking like MySQL. The fourth covers common reasons why indexes lose effectiveness (leverage). The fifth is a brief overview of MySQL table join algorithms because effective joins rely on effective indexes.
  15. Red Herrings of Performance
  16. Red herring is an idiom that refers to a distraction from a goal. When tracking down solutions to improve MySQL performance, two red herrings commonly distract engineers: faster hardware and MySQL tuning.
  17. Better, Faster Hardware!
  18. When MySQL performance isn’t acceptable, do not begin by scaling up (using better, faster hardware) to “see if that helps.” It probably will help if you scale up significantly, but you learn nothing because it only proves what you already know: computers run faster on faster hardware. Better, faster hardware is a red herring of performance because you miss learning the real causes of, and solutions to, slow performance.
  19. There are two reasonable exceptions. First, if the hardware is blatantly insufficient, then scale up to reasonable hardware. For example, using 1 GB of memory with 500 GB of data is blatantly insufficient. Upgrading to 32 GB or 64 GB of memory is reasonable. By contrast, upgrading to 384 GB of memory is sure to help but is unreasonable. Second, if the application is experiencing hyper-growth (a massive increase in users, usage, and data) and scaling up is a stopgap solution to keep the application running, then do it. Keeping the application running is always reasonable.
  20. Otherwise, scaling up to improve MySQL performance happens last. Experts agree: first optimize queries, data, access patterns, and the application. If all those optimizations do not yield sufficient performance, then scale up. Scaling up happens last for the following reasons.
  21. You don’t learn anything by scaling up, you simply clobber the problem with faster hardware. Since you’re an engineer, not a cave-dwelling protohuman, you solve problems by learning and understanding—you don’t clobber them. Admittedly, learning and understanding is more difficult and time-consuming, but it’s far more effective and sustainable, which leads to the next reason.
  22. Scaling up is not a sustainable approach. Upgrading physical hardware is nontrivial. Some upgrades are relatively quick and easy, but it depends on many factors outside the scope of this book. Sufficient to say, however, that you will drive yourself or the hardware engineers crazy if you frequently change hardware. Crazy engineers are not sustainable. Moreover, companies often use the same hardware for several years because the purchasing process is long and complicated. As a result, easy hardware scalability is one allure of the cloud. In the cloud, you can scale up (or down) CPU cores, memory, and storage in a few minutes. But this ease is significantly more expensive than physical hardware. Cloud costs can increase exponentially. The cost of Amazon RDS, for example, doubles from one instance size to the next—double the hardware, double the price. Exponentially increasing costs are not sustainable.
  23. Generally speaking, MySQL can fully utilize all the hardware that it’s given. (There are limits, which I address in Chapter 4.) The real question is: can the application fully utilize MySQL? The presumptive answer is yes, but it’s not guaranteed. Faster hardware helps MySQL but it does not change how the application uses MySQL. For example, increasing memory might not improve performance if the application causes table scans. Scaling up is only effective at increasing performance when the application workload can scale up, too. Not all workloads can scale up.
  24. NOTE
  25. Workload is the combination of queries, data, and access patterns.
  26. But let’s imagine that you successfully scale up the workload to fully utilize MySQL on the fastest hardware available. What happens as the application continues to grow, and its workload continues to increase? This reminds me of a Zen proverb: “When you reach the top of the mountain, keep climbing.” While I do encourage you to meditate on that, it presents a less enlightening dilemma for your application. With nowhere else to go, the only option is doing what should have been done first: optimize queries, data, access patterns, and the application.
  27. MySQL Tuning
  28. In the television series Star Trek, engineers are able to modify the ship to increase power to engines, weapons, shields, sensors, transporters, tractor beams—everything. MySQL is more difficult to operate than a starship because no such modifications are possible. But that does not stop engineers from trying.
  29. First, let’s clarify three terms.
  30. Tuning
  31. Tuning is adjusting MySQL system variables for research and development (R&D). It’s laboratory work with specific goals and criteria. Benchmarking is common: adjusting system variables to measure the effect on performance. The blog post “MySQL Challenge: 100k Connections” by renowned MySQL expert Vadim Tkachenko is an example of extreme tuning. Since tuning is R&D, the results are not expected to be generally applicable; rather, the goal is to expand our collective knowledge and understanding of MySQL, especially with respect to its current limits. Tuning influences future MySQL development and best practices.
  32. Configuring
  33. Configuring is setting system variables to values that are appropriate for the hardware and environment. The goal is a reasonable configuration with respect to a few default values that need to be changed. Configuring MySQL is usually done when the MySQL instance is provisioned or when hardware changes. It’s also necessary to reconfigure when data size increases by an order of magnitude, for example from 10 GB to 100 GB. Configuring influences how MySQL runs in general.
  34. Optimizing
  35. Optimizing is improving MySQL performance by reducing the workload or making it more efficient—usually the latter since application usage tends to increase. The goal is faster response time and more capacity with the existing hardware. Optimizing influences MySQL and application performance.
  36. You will undoubtedly encounter these terms in MySQL literature, videos, conferences, and so forth. The descriptions are more important than the terms. If, for example, you read a blog post that uses optimizing but describes what is defined here as tuning, then it’s tuning as defined here.
  37. The distinction of these terms is important because engineers do all three, but only optimizing (as defined here) is an efficient use of your time.1
  38. MySQL tuning is a red herring of performance for two reasons. First, it’s often not done as a controlled laboratory experiment, which makes the results dubious. In totality, MySQL performance is complex; experiments must be carefully controlled. Second, results are unlikely to have a significant effect on performance because MySQL is already highly optimized. Tuning MySQL is akin to squeezing blood from a turnip.
  39. Going back to the first paragraph of this section, I realize that we all admire Lieutenant Commander Geordi La Forge, the Chief Engineer in Star Trek: The Next Generation. When the captain calls for more power, we feel obligated to make it so by applying arcane server parameters. Or, on Earth, when the application needs more power, we want to save the day by applying an ingenious reconfiguration of MySQL that boosts throughput and concurrency by 50%. Good work, La Forge! Unfortunately, MySQL 8.0 introduced automatic configuration by enabling innodb_dedicated_server. Since MySQL 5.7 will be end-of-life (EOL) soon after this book is published, let’s keep looking to and building the future. Good work nevertheless, La Forge.
  40. Optimizing is all you need to do because tuning is a red herring and configuration is automatic as of MySQL 8.0. This book is all about optimizing.
  41. MySQL Indexes: A Visual Introduction
  42. Indexes are the key to performance and, if you recall “Direct Query Optimization”, changes to queries and indexes solve a lot of performance problems. The journey of query optimization requires a solid understanding of MySQL indexes, and that’s what this section presents—in detail with copious illustrations.
  43. Although this section is detailed and relatively long, I call it an introduction because there is more to learn. But this section is the key that unlocks the treasure chest of MySQL query optimizations.
  44. The following nine sections apply only to standard indexes on InnoDB tables—the type of index created by a simple PRIMARY KEY or [UNIQUE] INDEX table definition. MySQL supports other specialized index types, but I don’t cover them in this book because standard indexes are the basis of performance.
  45. Before we dive into the details of MySQL indexes, I begin with a revelation about InnoDB tables that will change the way you see not only indexes but most of MySQL performance.
  46. InnoDB Tables Are Indexes
  47. Example 2-1 is the structure of table elem (short for elements) and the 10 rows that it contains. All examples in this chapter refer to table elem—with one clearly noted exception—so take a moment to study it.
  48. Example 2-1. Table elem
  49. CREATE TABLE `elem` (
  50. `id` int unsigned NOT NULL,
  51. `a` char(2) NOT NULL,
  52. `b` char(2) NOT NULL,
  53. `c` char(2) NOT NULL,
  54. PRIMARY KEY (`id`),
  55. KEY `idx_a_b` (`a`,`b`)
  56. ) ENGINE=InnoDB;
  57. +----+------+------+------+
  58. | id | a | b | c |
  59. +----+------+------+------+
  60. | 1 | Ag | B | C |
  61. | 2 | Au | Be | Co |
  62. | 3 | Al | Br | Cr |
  63. | 4 | Ar | Br | Cd |
  64. | 5 | Ar | Br | C |
  65. | 6 | Ag | B | Co |
  66. | 7 | At | Bi | Ce |
  67. | 8 | Al | B | C |
  68. | 9 | Al | B | Cd |
  69. | 10 | Ar | B | Cd |
  70. +----+------+------+------+
  71. Table elem has two indexes: the primary key on column id and a nonunique secondary index on columns a, b. The value for column id is a monotonically increasing integer. The values for columns a, b, and c are atomic symbols corresponding to the column name letter: “Ag” (silver) for column a, “B” (boron) for column b, and so on. The row values are random and meaningless; it’s just a simple table used for examples.
  72. Figure 2-3 shows a typical view of table elem—just the first four rows for brevity.
  73. emsp 0203
  74. Figure 2-3. Table elem: visual model
  75. Nothing special about table elem, right? It’s so simple, one might say it’s elementary. But what if I told you that it’s not really a table, it’s an index? Get the “F” (fluorine) out of here! Figure 2-4 shows the true structure of table elem as an InnoDB table.
  76. emsp 0204
  77. Figure 2-4. Table elem: InnoDB B-tree index
  78. InnoDB tables are B-tree indexes organized by the primary key. Rows are index records stored in leaf nodes of the index structure. Each index record has metadata (denoted by “…”) used for row locking, transaction isolation, and so on.
  79. Figure 2-4 is a highly simplified depiction of the B-tree index that is table elem. Four index records (at bottom) correspond to the first four rows. Primary key column values (1, 2, 3, and 4) are shown at the top of each index record. Other column values (“Ag,” “B,” “C,” and so forth) are shown below the metadata for each index record.
  80. You don’t need to know the technical details of InnoDB B-tree indexes to understand or achieve remarkable MySQL performance. Only two points are important:
  81. Primary key lookups are extremely fast and efficient
  82. The primary key is pivotal to MySQL performance
  83. The first point is true because B-tree indexes are inherently fast and efficient, which is one reason why many database servers use them. The second point becomes increasingly clear in the coming sections—and chapters.
  84. To learn about the fascinating world of database internals, including indexes, read Database Internals by Alex Petrov (O’Reilly, 2019). For a deep dive into InnoDB internals, including its B-tree implementation, cancel all your meetings and check out the website of renowned MySQL expert Jeremy Cole.
  85. NOTE
  86. An InnoDB primary key is a clustered index. The MySQL manual occasionally refers to the primary key as the clustered index.
  87. Indexes provide the most and the best leverage because the table is an index. The primary key is pivotal to performance. This is especially true because secondary indexes include primary key values. Figure 2-5 shows the secondary index on columns a, b.
  88. emsp 0205
  89. Figure 2-5. Secondary index on columns a, b
  90. Secondary indexes are B-tree indexes, too, but leaf nodes store primary key values. When MySQL uses a secondary index to find a row, it does a second lookup on the primary key to read the full row. Let’s put the two together and walk through a secondary index lookup for query SELECT * FROM elem WHERE a='Au' AND b='Be':
  91. emsp 0206
  92. Figure 2-6. Secondary index lookup for value “Au, Be”
  93. Figure 2-6 shows the secondary index (columns a, b) on top and the primary key (column id) on bottom. Six callouts (numbered circles) show the lookup for value “Au, Be” using the secondary index:
  94. Index lookups begin at the root node; branch right to an internal node for value “Au, Be.”
  95. At an internal node, branch right to the leaf node for value “Au, Be.”
  96. Leaf node for secondary index value “Au, Be” contains the corresponding primary key value: 2.
  97. Begin primary key lookup at the root node; branch left to an internal node for value 2.
  98. At an internal node, branch right to the leaf node for value 2.
  99. Leaf node for primary key value 2 contains the full row matching “Au, Be.”
  100. NOTE
  101. A table has only one primary key. All other indexes are secondary indexes.
  102. This section is short but incredibly important because the correct model provides the foundation for understanding indexes and more. For example, if you think back to “Lock time”, you might see it in a new light since rows are actually leaf nodes in the primary key. Knowing that an InnoDB table is its primary key is akin to knowing that heliocentrism, not geocentrism, is the correct model of the solar system. In the world of MySQL, everything revolves around the primary key.
  103. Table Access Methods
  104. Using an index to look up rows is one of three table access methods. Since tables are indexes, an index lookup is the best and most common access method. But sometimes, depending on the query, an index lookup is not possible and the only recourse is an index scan or a table scan—the other access methods. Knowing which access method MySQL uses for a query is imperative because performance requires an index lookup. Avoid index scans and table scans. “EXPLAIN: Query Execution Plan” shows how to see the access method. But first, let’s clarify and visualize each one.
  105. NOTE
  106. The MySQL manual uses the terms access method, access type, and join type. And EXPLAIN uses a field called type or access_type to refer to those terms. In MySQL, the terms are closely related but used equivocally.
  107. In this book, for precision and consistency I use only two terms: access method and access type. There are three access methods: index lookup, index scan, and table scan. For an index lookup, there are several access types: ref, eq_ref, range, and so forth.
  108. Index lookup
  109. An index lookup finds specific rowsor ranges of rowsby leveraging the ordered structure and algorithmic access of an index. This is the fastest access method because it’s precisely what indexes are designed for: fast and efficient access to large amounts of data. Consequently, index lookups are essential for direct query optimization. Performance requires that practically every query uses an index lookup for every table. There are several access types for an index lookup that I cover in forthcoming sections such asWHERE”.
  110. Figure 2-6 in the previous section shows an index lookup using a secondary index.
  111. Index scan
  112. When an index lookup is not possible, MySQL must use brute force to find rows: read all rows and filter out non-matching ones. Before MySQL resorts to reading every row using the primary key, it tries to read rows using a secondary index. This is called an index scan.
  113. There are two types of index scan. The first is a full index scan, meaning MySQL reads all rows in index order. Reading all rows is usually terrible for performance, but reading them in index order can avoid sorting rows when the index order matches the query ORDER BY.
  114. Figure 2-7 shows a full index scan for query SELECT * FROM elem FORCE INDEX (a) ORDER BY a, b. The FORCE INDEX clause is required because, since table elem is tiny, it’s more efficient for MySQL to scan the primary key and sort the rows rather than scan the secondary index and fetch the rows in order. (Sometimes bad queries make good examples.)
  115. Figure 2-7 has eight callouts (numbered circles) that show the order of row access:
  116. Read first value of secondary index (SI): “Ag, B.”
  117. Look up corresponding row in primary key (PK).
  118. Read second value of SI: “Al, Br.”
  119. Look up corresponding row in PK.
  120. Read third value of SI: “Ar, Br.”
  121. Look up corresponding row in PK.
  122. Read fourth value of SI: “Au, Be.”
  123. Look up corresponding row in PK.
  124. emsp 0207
  125. Figure 2-7. Full index scan on secondary index
  126. There is a subtle but important detail in Figure 2-7: scanning the secondary index in order might be sequential reads, but the primary key lookups are almost certainly random reads. Accessing rows in index order does not guarantee sequential reads; more than likely, it incurs random reads.
  127. NOTE
  128. Sequential access (reads and writes) is faster than random access.
  129. The second type of index scan is an index-only scan: MySQL reads column values (not full rows) from the index. This requires a covering index, which is covered later (pun intended) in “Covering Indexes”. It should be faster than a full index scan because it doesn’t require primary key lookups to read full rows; it only reads column values from the secondary index, which is why it requires a covering index.
  130. Don’t optimize for an index scan unless the only alternative is a full table scan. Otherwise, avoid index scans.
  131. Table scan
  132. A (full) table scan reads all rows in primary key order. When MySQL cannot do an index lookup or an index scan, a table scan is the only option. This is usually terrible for performance, but it’s also usually easy to fix because MySQL is adept at using indexes and has many index-based optimizations. Essentially every query with a WHERE, GROUP BY, or ORDER BY clause can use an index—even if just an index scan—because those clauses use columns and columns can be indexed. Consequently, there are nearly zero reasons for an unfixable table scan.
  133. Figure 2-8 shows a full table scan: reading all rows in primary key order. It has four callouts that show the order of row access. Table elem is tiny and only four rows are shown here, but imagine MySQL slogging through thousands or millions of rows in a real table.
  134. The general advice and best practice is to avoid table scans. But for a complete and balanced discussion, there are two cases when a table scan might be acceptable or (surprisingly) better:
  135. When the table is tiny and infrequently accessed
  136. When the table selectivity is very low (see “Extreme Selectivity”)
  137. emsp 0208
  138. Figure 2-8. Full table scan
  139. But don’t take any table scan for granted: they’re usually bad for performance. In very rare cases, MySQL can incorrectly choose a table scan when an index lookup is possible, as explained in “It’s a Trap! (When MySQL Chooses Another Index)”.
  140. Leftmost Prefix Requirement
  141. To use an index, a query must use a leftmost prefix of the index: one or more index columns starting with the leftmost index column as specified by the index definition. A leftmost prefix is required because the underlying index structure is ordered by the index column order, and it can only be traversed (searched) in that order.
  142. NOTE
  143. Use SHOW CREATE TABLE or SHOW INDEX to see index definitions.
  144. Figure 2-9 shows an index on columns a, b, c and a WHERE clause using each leftmost prefix: column a; columns a, b; and columns a, b, c.
  145. emsp 0209
  146. Figure 2-9. Leftmost prefixes of a three-column index
  147. The top WHERE clause in Figure 2-9 uses column a, which is the leftmost column of the index. The middle WHERE clause uses columns a and b that, together, form a leftmost prefix of the index. And the bottom WHERE clause uses the entire index: all three columns. It’s ideal to use all columns of an index, but it’s not required; only a leftmost prefix is required. Index columns can be used in other SQL clauses, as illustrated by many examples in the following sections.
  148. TIP
  149. To use an index, a query must use a leftmost prefix of the index.
  150. The leftmost prefix requirement has two logical consequences:
  151. Indexes (a, b) and (b, a) are different indexes. They index the same columns but in a different order, which results in different leftmost prefixes. However, a query that uses both columns (for example, WHERE a = 'Au' AND b = 'Be') can use either index, but that does not mean the indexes are equivalent in terms of performance. MySQL will choose the better of the two by calculating many factors.
  152. MySQL can most likely use index (a, b, c) in place of indexes (a) and (a, b) because the latter two are leftmost prefixes of the first. In this case, indexes (a) and (a, b) are duplicates and can be dropped. Use pt-duplicate-key-checker to find and report duplicate indexes.
  153. Lurking at the end (rightmost) of every secondary index is the primary key. For table elem (Example 2-1), the secondary index is effectively (a, b, id), but the rightmost id is hidden. MySQL doesn’t show the primary key appended to secondary indexes; you have to imagine it.
  154. NOTE
  155. The primary key is appended to every secondary index: (S, P) where S are secondary index columns and P are primary key columns.
  156. In MySQL lingo we say, “The primary key is appended to secondary indexes” even though it’s not literally appended. (You can literally append it by creating index (a, b, id), but don’t do that.) “Appended to” really means that secondary index leaf nodes contain primary key values, as shown earlier in Figure 2-5. This is important because it increases the size of every secondary index: primary key values are duplicated in secondary indexes. Larger indexes require more memory, which means fewer indexes can fit in memory. Keep the size of the primary key small and the number of secondary indexes reasonable. Just the other day, my colleagues were helping a team whose database has 693 GB of secondary indexes on 397 GB of data (primary key).
  157. The leftmost prefix requirement is a blessing and a restriction. The restriction is relatively easy to work around with additional secondary indexes, but wait until you read “Excessive, Duplicate, and Unused”. Joining tables is a particular challenge given the restriction, but I address it inJoin Tables”. I encourage you to see the leftmost prefix requirement as a blessing. Query optimization with respect to indexing is not trivial, but the leftmost prefix requirement is a simple and familiar starting point on the journey.
  158. EXPLAIN: Query Execution Plan
  159. The MySQL EXPLAIN command shows a query execution plan (or, EXPLAIN plan) that describes how MySQL plans to execute the query: table join order, table access method, index usage, and other important details.
  160. EXPLAIN output is vast and varied. Moreover, it’s completely dependent on the query. Changing a single character in a query can significantly change its EXPLAIN plan. For example, WHERE id = 1 verses WHERE id > 1 yields a significantly different EXPLAIN plan. And to complicate the matter further, EXPLAIN continues to evolve. “EXPLAIN Output Format” in the MySQL manual is required reading—even for experts. Fortunately for the sake of our sanity, the fundamentals have remained the same for decades.
  161. To illustrate index usage, the next five sections explain queries for each case that MySQL can use an index:
  162. Find matching rows: “WHERE
  163. Group rows: “GROUP BY
  164. Sort rows: “ORDER BY
  165. Avoid reading rows: “Covering Indexes”
  166. Join tables: “Join Tables”
  167. There are other specific cases like MIN() and MAX(), but these five cases are the bread and butter of index usage.
  168. But first I need to set the stage by reviewing the meaning of the EXPLAIN output fields shown in Example 2-2.
  169. Example 2-2. EXPLAIN output (traditional format)
  170. EXPLAIN SELECT * FROM elem WHERE id = 1\G
  171. *************************** 1. row ***************************
  172. id: 1
  173. select_type: SIMPLE
  174. table: elem
  175. partitions: NULL
  176. type: const
  177. possible_keys: PRIMARY
  178. key: PRIMARY
  179. key_len: 4
  180. ref: const
  181. rows: 1
  182. filtered: 100.00
  183. Extra: NULL
  184. For this introduction, we ignore fields id, select_type, partitions, key_len, and filtered; but the examples include them to habituate you to the output. The remaining seven fields convey a wealth of information that constitutes the query execution plan:
  185. table
  186. The table field is the table name (or alias) or subquery reference. Tables are listed in the join order determined by MySQL, not the order they appear in the query. The top table is the first table, and the bottom table is the last table.
  187. type
  188. They type field is the table access method or index lookup access type—see the first note inTable Access Methods” for clarification. ALL means a full table scan (see “Table scan”). index means an index scan (see “Index scan”). Any other value—const, ref, range, and so onis an access type for an index lookup (see “Index lookup”).
  189. possible_keys
  190. The possible_keys field lists indexes that MySQL could use because the query uses a leftmost prefix. If an index is not listed in this field, then the leftmost prefix requirement is not met.
  191. key
  192. The key field is the name of the index that MySQL will use, or NULL if no index can be used. MySQL chooses the best index based on many factors, some of which are indicated in the Extra field. It’s a safe bet that MySQL will use this index when executing the query (EXPLAIN does not execute the query), but see “It’s a Trap! (When MySQL Chooses Another Index)”.
  193. ref
  194. The ref field lists the source of values used to look up rows in the index (the key field).
  195. For single-table queries or the first table in a join, ref is often const, which refers to a constant condition on one or more index columns. A constant condition is equality (= or <=> [NULL-safe equal]) to a literal value. For example, a = 'Au' is a constant condition that equals only one value.
  196. For queries that join multiple tables, ref is a column reference from the preceding table in the join order. MySQL joins the current table (the table field) using the index to look up rows that match values from column ref in the preceding table. “Join Tables” shows this in action.
  197. rows
  198. The rows field is the estimated number of rows that MySQL will examine to find matching rows. MySQL uses index statistics to estimate rows, so expect the real number—“Rows examined”—to be close but different.
  199. Extra
  200. The Extra field provides additional information about the query execution plan. This field is important because it indicates query optimizations that MySQL can apply, if any.
  201. NOTE
  202. All EXPLAIN output in this book is traditional format: tabular output (EXPLAIN query;) or list output (EXPLAIN query\G). Other formats are JSON (EXPLAIN FORMAT=JSON query) and, as of MySQL 8.0.16, tree (EXPLAIN FORMAT=TREE query). JSON and tree formats are completely different than traditional format, but all formats convey the query execution plan.
  203. Don’t expect to glean much information from those fields without context: tables, indexes, data, and a query. In the following sections, all illustrations refer to table elem (Example 2-1), its two indexes, and its ten rows.
  204. WHERE
  205. MySQL can use an index to find rows that match table conditions in a WHERE clause. I’m careful to say that MySQL can use an index, not that MySQL will use an index, because index usage depends on several factors, primarily: table conditions, indexes, and the leftmost prefix requirement (see “Leftmost Prefix Requirement”). (There are other factors, like index statistics and optimizer costs, but they’re beyond the scope of this book.)
  206. A table condition is a column and its value (if any) that matches, groups, aggregates, or orders rows. (For brevity, I use the term condition when it’s unambiguous.) In a WHERE clause, table conditions are also called predicates.
  207. Figure 2-10 shows the primary key on column id and a WHERE clause with a single condition: id = 1.
  208. emsp 0210
  209. Figure 2-10. WHERE: primary key lookup
  210. A solid box delineates a table condition and an index column (also called an index part) that MySQL can use because the former (table condition) is a leftmost prefix of the latter (index). An arrow points from the table condition to the index column that it uses. Later, we’ll see examples of table conditions and index columns that MySQL cannot use.
  211. In Figure 2-10, MySQL can find rows that match condition id = 1 using primary key column id. Example 2-3 is the EXPLAIN plan for the full query.
  212. Example 2-3. EXPLAIN plan for primary key lookup
  213. EXPLAIN SELECT * FROM elem WHERE id = 1\G
  214. *************************** 1. row ***************************
  215. id: 1
  216. select_type: SIMPLE
  217. table: elem
  218. partitions: NULL
  219. type: const
  220. possible_keys: PRIMARY
  221. key: PRIMARY
  222. key_len: 4
  223. ref: const
  224. rows: 1
  225. filtered: 100.00
  226. Extra: NULL
  227. In Example 2-3, key: PRIMARY confirms that MySQL will use the primary key—an index lookup. Correspondingly, the access type (the type field) is not ALL (table scan) or index (index scan), which is expected given a simple primary key lookup. The secondary index is not listed in the possible_keys field because MySQL cannot use it for this query: column id is not a leftmost prefix of the secondary index on columns a, b.
  228. Access type const is a special case that occurs only when there are constant conditions (ref: const) on all index columns of the primary key or a unique secondary index. The result is a constant row. This is a little too in-depth for an introduction, but since we’re here, let’s keep learning. Given the table data (Example 2-1) and the fact that column id is the primary key, the row identified by id = 1 can be treated as constant because, when the query is executed, id = 1 can match only one row (or no row). MySQL reads that one row and treats its values as constant, which is great for response time: const access is extremely fast.
  229. Extra: NULL is somewhat rare because real queries are more complex than these examples. But here, Extra: NULL means that MySQL does not need to match rows. Why? Because the constant row can match only one row (or no row). But matching rows is the norm, so let’s see a more realistic example by changing the table conditions to id > 3 AND id < 6 AND c = 'Cd', as shown in Figure 2-11 and the corresponding EXPLAIN plan in Example 2-4.
  230. emsp 0211
  231. Figure 2-11. WHERE: range access using primary key
  232. Example 2-4. EXPLAIN plan for range access using primary key
  233. EXPLAIN SELECT * FROM elem WHERE id > 3 AND id < 6 AND c = 'Cd'\G
  234. *************************** 1. row ***************************
  235. id: 1
  236. select_type: SIMPLE
  237. table: elem
  238. partitions: NULL
  239. > type: range
  240. possible_keys: PRIMARY
  241. key: PRIMARY
  242. key_len: 4
  243. > ref: NULL
  244. > rows: 2
  245. filtered: 10.00
  246. > Extra: Using where
  247. NOTE
  248. To highlight EXPLAIN plan changes, I prepend > characters to the pertinent fields that changed. These highlights are not part of EXPLAIN.
  249. By changing the table conditions to id > 3 AND id < 6 AND c = 'Cd', the EXPLAIN plan changes from Example 2-3 to Example 2-4, which is more realistic for a single-table query. The query still uses the primary key (key: PRIMARY), but the access type changes to a range scan (type: range): using an index to read rows between a range of values. In this case, MySQL uses the primary key to read rows where the value of column id is between 3 and 6. The ref field is NULL because the conditions on column id are not constant (and this is a single-table query, so there’s no preceding table to reference). The condition c = 'Cd' is constant, but it’s not used for the index lookup (the range scan), so ref does not apply. MySQL estimates that it will examine two rows in the range (rows: 2). That’s correct for this trivial example, but remember: rows is an estimate.
  250. Using wherein the Extra field is so common that it’s expected. It means that MySQL will find matching rows using the WHERE conditions: for each row read, a row matches if all WHERE conditions are true. Since the conditions on column id define the range, it’s really just the condition on column c that MySQL will use to match rows in the range. Glancing back at Example 2-1, one row matches all the WHERE conditions:
  251. +----+------+------+------+
  252. | id | a | b | c |
  253. +----+------+------+------+
  254. | 4 | Ar | Br | Cd |
  255. +----+------+------+------+
  256. The row with id = 5 is in the range, so MySQL examines the row, but its column c value (“Cd”) does not match the WHERE clause, so MySQL does not return the row.
  257. To illustrate other query execution plans, let’s use both leftmost prefixes of the secondary index, as shown in Figure 2-12 and the corresponding EXPLAIN plans in Example 2-5.
  258. emsp 0212
  259. Figure 2-12. WHERE: secondary index lookups
  260. Example 2-5. EXPLAIN plans for secondary index lookups
  261. EXPLAIN SELECT * FROM elem WHERE a = 'Au'\G
  262. *************************** 1. row ***************************
  263. id: 1
  264. select_type: SIMPLE
  265. table: elem
  266. partitions: NULL
  267. > type: ref
  268. possible_keys: idx_a_b
  269. > key: idx_a_b
  270. key_len: 3
  271. ref: const
  272. rows: 1
  273. filtered: 100.00
  274. Extra: NULL
  275. EXPLAIN SELECT * FROM elem WHERE a = 'Au' AND b = 'Be'\G
  276. *************************** 1. row ***************************
  277. id: 1
  278. select_type: SIMPLE
  279. table: elem
  280. partitions: NULL
  281. > type: ref
  282. possible_keys: idx_a_b
  283. > key: idx_a_b
  284. key_len: 6
  285. ref: const,const
  286. rows: 1
  287. filtered: 100.00
  288. Extra: NULL
  289. For each EXPLAIN plan in Example 2-5, key: idx_a_b confirms that MySQL uses the secondary index because the conditions meet the leftmost prefix requirement. The first WHERE clause uses only the first index part: column a. The second WHERE clause uses both index parts: columns a and b. Using only column b would not meet the leftmost prefix requirement—I show this in a moment.
  290. What’s new and important from previous EXPLAIN plans is the access type: ref. In simplest terms, the ref access type is an equality (= or <=>) lookup on a leftmost prefix of the index (the key field). Like any index lookup, ref access is very fast as long as the estimated number of rows to examine (the rows field) is reasonable.
  291. Although the conditions are constant, the const access type is not possible because the index (key: idx_a_b) is nonunique, so the lookup can match more than one row. And even though MySQL estimates that each WHERE clause will examine only one row (rows: 1), that could change when the query is executed.
  292. Extra: NULL occurs again because MySQL can find matching rows using only the index since there are no conditions on non-indexed columns—so let’s add one. Figure 2-13 shows a WHERE clause with conditions on columns a and c, and Example 2-6 is the corresponding EXPLAIN plan.
  293. emsp 0213
  294. Figure 2-13. WHERE: index lookup and non-indexed column
  295. Example 2-6. EXPLAIN plan for index lookup and non-indexed column
  296. EXPLAIN SELECT * FROM elem WHERE a = 'Al' AND c = 'Co'\G
  297. *************************** 1. row ***************************
  298. id: 1
  299. select_type: SIMPLE
  300. table: elem
  301. partitions: NULL
  302. type: ref
  303. possible_keys: idx_a_b
  304. key: idx_a_b
  305. key_len: 3
  306. ref: const
  307. > rows: 3
  308. filtered: 10.00
  309. > Extra: Using where
  310. In Figure 2-13, there is no box around condition c = 'Co' because the index does not cover column c. MySQL still uses the secondary index (key: idx_a_b), but the condition on column c prevents MySQL from matching rows using only the index. Instead, MySQL uses the index to look up and read rows for the condition on column a, then it matches rows for the condition on column c (Extra: Using where).
  311. Glancing back at Example 2-1 again, you’ll notice that zero rows match this WHERE clause, but EXPLAIN reports rows: 3. Why? The index lookup on column a matches three rows where a = 'Al' is true: row id values 3, 8, and 9. But none of these rows also matches c = 'Co'. The query examines three rows but matches zero rows.
  312. TIP
  313. EXPLAIN output rows is an estimate of the number of rows that MySQL will examine when it executes the query, not the number of rows that will match all table conditions.
  314. As a final example of indexes, WHERE, and EXPLAIN, let’s not meet the leftmost prefix requirement, as shown in Figure 2-14 and Example 2-7.
  315. emsp 0214
  316. Figure 2-14. WHERE without leftmost prefix
  317. Example 2-7. EXPLAIN plan for WHERE without leftmost prefix
  318. EXPLAIN SELECT * FROM elem WHERE b = 'Be'\G
  319. *************************** 1. row ***************************
  320. id: 1
  321. select_type: SIMPLE
  322. table: elem
  323. partitions: NULL
  324. > type: ALL
  325. possible_keys: NULL
  326. > key: NULL
  327. key_len: NULL
  328. ref: NULL
  329. rows: 10
  330. filtered: 10.00
  331. Extra: Using where
  332. A dotted box outline (and lack of arrow) delineates a table condition and an index column that MySQL cannot use because they do not meet the leftmost prefix requirement.
  333. In Figure 2-14, there is no condition on column a, therefore the index cannot be used for the condition on column b. The EXPLAIN plan (Example 2-7) confirms this: possible_keys: NULL and key: NULL. Without an index, MySQL is forced to do a full table scan: type: ALL. Likewise, rows: 10 reflects the total number of rows, and Extra: Using where reflects that MySQL reads and then filters rows not matching b = 'Be'.
  334. Example 2-7 is an example of the worst possible EXPLAIN plan. Whenever you see type: ALL, possible_keys: NULL, or key: NULL, stop what you’re doing and analyze the query.
  335. As simple as these examples have been, they represent the fundamentals of EXPLAIN with respect to indexes and WHERE clauses. Real queries have more indexes and WHERE conditions, but the fundamentals don’t change.
  336. GROUP BY
  337. MySQL can use an index to optimize GROUP BY because values are implicitly grouped by index order. For the secondary index idx_a_b (on columns a, b), there are five distinct groups of column a values, as shown in Example 2-8.
  338. Example 2-8. Distinct groups of column a values
  339. SELECT a, b FROM elem ORDER BY a, b;
  340. +------+------+
  341. | a | b |
  342. +------+------+
  343. | Ag | B | -- Ag group
  344. | Ag | B |
  345. | Al | B | -- Al group
  346. | Al | B |
  347. | Al | Br |
  348. | Ar | B | -- Ar group
  349. | Ar | Br |
  350. | Ar | Br |
  351. | At | Bi | -- At group
  352. | Au | Be | -- Au group
  353. +------+------+
  354. I separated the groups in Example 2-8 with blank lines and annotated the first row in each group. A query with GROUP BY a can use index idx_a_b because column a is a leftmost prefix and the index is implicitly grouped by column a values. Example 2-9 is a representative EXPLAIN plan for the simplest type of GROUP BY optimization.
  355. Example 2-9. EXPLAIN plan for GROUP BY a
  356. EXPLAIN SELECT a, COUNT(*) FROM elem GROUP BY a\G
  357. *************************** 1. row ***************************
  358. id: 1
  359. select_type: SIMPLE
  360. table: elem
  361. partitions: NULL
  362. > type: index
  363. possible_keys: idx_a_b
  364. key: idx_a_b
  365. key_len: 6
  366. ref: NULL
  367. rows: 10
  368. filtered: 100.00
  369. > Extra: Using index
  370. key: idx_a_b confirms that MySQL uses the index to optimize the GROUP BY. Since the index is ordered, MySQL is assured that each new value for column a is a new group. For example, after reading the last “Ag” value, the index order assures that no more “Ag” values will be read, so the “Ag” group is complete.
  371. Using index” in the Extra field indicates that MySQL is reading column a values only from the index; it’s not reading full rows from the primary key. I cover this optimization in “Covering Indexes”.
  372. This query uses an index, but not for an index lookup: type: index denotes an index scan (see “Index scan”). And since there’s no WHERE clause to filter rows, MySQL reads all rows. If you add a WHERE clause, MySQL can still use the index for the GROUP BY, but the leftmost prefix requirement still applies. In this case, the query is using the leftmost index part (column a), so the WHERE condition must be on column a or b to meet the leftmost prefix requirement. Let’s first add a WHERE condition on column a, as shown in Figure 2-15 and Example 2-10.
  373. emsp 0215
  374. Figure 2-15. GROUP BY and WHERE on same index column
  375. Example 2-10. EXPLAIN plan for GROUP BY and WHERE on same index column
  376. EXPLAIN SELECT a, COUNT(a) FROM elem WHERE a != 'Ar' GROUP BY a\G
  377. *************************** 1. row ***************************
  378. id: 1
  379. select_type: SIMPLE
  380. table: elem
  381. partitions: NULL
  382. > type: range
  383. possible_keys: idx_a_b
  384. key: idx_a_b
  385. key_len: 3
  386. ref: NULL
  387. rows: 7
  388. filtered: 100.00
  389. > Extra: Using where; Using index
  390. Using wherein the Extra field refers to WHERE a != 'Ar'. The interesting change is type: range. The range access type works with the not-equal operator (!= or <>). You can think of it like WHERE a < 'Ar' AND a > 'Ar', as shown in Figure 2-16.
  391. A condition on column b in the WHERE clause can still use the index because the conditions, regardless of being in different SQL clauses, meet the leftmost prefix requirement. Figure 2-17 shows this, and Example 2-11 shows the EXPLAIN plan.
  392. emsp 0216
  393. Figure 2-16. Range for not-equal
  394. emsp 0217
  395. Figure 2-17. GROUP BY and WHERE on different index columns
  396. Example 2-11. EXPLAIN plan for GROUP BY and WHERE on different index columns
  397. EXPLAIN SELECT a, b FROM elem WHERE b = 'B' GROUP BY a\G
  398. *************************** 1. row ***************************
  399. id: 1
  400. select_type: SIMPLE
  401. table: elem
  402. partitions: NULL
  403. type: range
  404. possible_keys: idx_a_b
  405. key: idx_a_b
  406. key_len: 6
  407. ref: NULL
  408. rows: 6
  409. filtered: 100.00
  410. > Extra: Using where; Using index for group-by
  411. The query in Example 2-11 has two important details: an equality condition on column b in the WHERE clause, and selecting columns a and b in the SELECT clause. These details enable the special “Using index for group-by” optimization revealed in the Extra field. If, for example, the equality (=) is changed to not-equal (!=), the query optimization is lost. When it comes to query optimizations like this, details are critical. You must read the MySQL manual to learn and apply the details. “GROUP BY Optimization” in the MySQL manual elaborates.
  412. The final GROUP BY example in Figure 2-18 and Example 2-12 might surprise you.
  413. emsp 0218
  414. Figure 2-18. GROUP BY without leftmost prefix
  415. Example 2-12. EXPLAIN plan for GROUP BY without leftmost prefix
  416. EXPLAIN SELECT b, COUNT(*) FROM elem GROUP BY b\G
  417. *************************** 1. row ***************************
  418. id: 1
  419. select_type: SIMPLE
  420. table: elem
  421. partitions: NULL
  422. > type: index
  423. possible_keys: idx_a_b
  424. key: idx_a_b
  425. key_len: 6
  426. ref: NULL
  427. rows: 10
  428. filtered: 100.00
  429. > Extra: Using index; Using temporary
  430. Notice key: idx_a_b: MySQL uses the index despite the query having no condition on column a. What happened to the leftmost prefix requirement? It’s being met because MySQL is scanning the index (type: index) on column a. You can imagine a condition on column a that’s always true, like a = a.
  431. Would MySQL still index scan on column a for GROUP BY c? No, it would not; it would do a full table scan. Figure 2-18 works because the index has column b values; it does not have column c values.
  432. Using temporary” in the Extra field is a side effect of not having a strict set of leftmost prefix conditions. As MySQL reads column a values from the index, it collects column b values in a temporary table (in memory). After reading all column a values, it table scans the temporary table to group and aggregate for COUNT(*).
  433. There is a lot more to learn about GROUP BY with respect to indexes and query optimizations, but these examples are the fundamentals. Unlike WHERE clauses, GROUP BY clauses tend to be simpler. The challenge is creating an index to optimize GROUP BY plus other SQL clauses. MySQL has the same challenge when formulating the query execution plan, so it might not optimize GROUP BY even when possible. MySQL almost always chooses the best query execution plan, but if you want to experiment with different ones, read “Index Hints” in the MySQL manual.
  434. ORDER BY
  435. Unsurprisingly, MySQL can use an ordered index to optimize ORDER BY. This optimization avoids sorting rows, which takes a little more time, by accessing rows in order. Without this optimization, MySQL reads all matching rows, sorts them, then returns the sorted result set. When MySQL sorts rows, it prints “Using filesort” in the Extra field of the EXPLAIN plan. Filesort means sort rows. It’s a historical (and now misleading) term but still the prevalent term in MySQL lingo.
  436. Filesort is a consternation for engineers because it has a reputation for being slow. Sorting rows is extra work, so it does not improve response time, but it’s usually not the root cause of slow response time. At the end of this section, I use EXPLAIN ANALYZE, which is new as of MySQL 8.0.18, to measure the real-time penalty of filesort. (Spoiler: sorting rows is very fast.) But first, let’s examine how to use indexes to optimize ORDER BY.
  437. There are three ways to use an index to optimize ORDER BY. The first and simplest way is using a leftmost prefix of an index for the ORDER BY clause. For table elem, that means:
  438. ORDER BY id
  439. ORDER BY a
  440. ORDER BY a, b
  441. The second way is to hold a leftmost part of the index constant and order by the next index columns. For example, holding column a constant and ordering by column b, as shown in Figure 2-19 with corresponding EXPLAIN plan in Example 2-13.
  442. emsp 0219
  443. Figure 2-19. ORDER BY and WHERE on different index columns
  444. Example 2-13. EXPLAIN plan for ORDER BY and WHERE on different index columns
  445. EXPLAIN SELECT a, b FROM elem WHERE a = 'Ar' ORDER BY b\G
  446. *************************** 1. row ***************************
  447. id: 1
  448. select_type: SIMPLE
  449. table: elem
  450. partitions: NULL
  451. type: ref
  452. possible_keys: idx_a_b
  453. key: idx_a_b
  454. key_len: 3
  455. ref: const
  456. rows: 3
  457. filtered: 100.00
  458. Extra: Using index
  459. WHERE a = 'Ar' ORDER BY b can use index (a, b) because the WHERE condition on the first index part (column a) is constant, so MySQL jumps to a = 'Ar' in the index and, from there, reads column b values in order. Example 2-14 is the result set, and although it’s nothing fancy, it shows that column a is constant (value “Ar”) and column b is sorted.
  460. Example 2-14. Result set of WHERE a = 'Ar' ORDER BY b
  461. +------+------+
  462. | a | b |
  463. +------+------+
  464. | Ar | B |
  465. | Ar | Br |
  466. | Ar | Br |
  467. +------+------+
  468. If table elem had an index on columns a, b, c, a query like WHERE a = 'Au' AND b = 'Be' ORDER BY c could use the index because the conditions on columns a and b hold the leftmost part of the index.
  469. The third way is a special case of the second. Before showing the figure that explains it, see if you can determine why the query in Example 2-15 does not cause a filesort (why “Using filesort” is not reported in the Extra field).
  470. Example 2-15. EXPLAIN plan for ORDER BY id
  471. EXPLAIN SELECT * FROM elem WHERE a = 'Al' AND b = 'B' ORDER BY id\G
  472. *************************** 1. row ***************************
  473. id: 1
  474. select_type: SIMPLE
  475. table: elem
  476. partitions: NULL
  477. type: ref
  478. possible_keys: idx_a_b
  479. key: idx_a_b
  480. key_len: 16
  481. ref: const,const
  482. rows: 2
  483. filtered: 100.00
  484. > Extra: Using index condition
  485. It’s understandable that the query uses index idx_a_b because the WHERE conditions are a leftmost prefix, but shouldn’t ORDER BY id cause a filesort? Figure 2-20 reveals the answer.
  486. emsp 0220
  487. Figure 2-20. ORDER BY using primary key appended to secondary index
  488. “Leftmost Prefix Requirement” has a paragraph that begins with, “Lurking at the end (rightmost) of every secondary index is the primary key.” That’s what’s happening in Figure 2-20: the dark box around index column id reveals the “hidden” primary key appended to the secondary index. This ORDER BY optimization might not seem useful with a little table like elem, but with real tables it can be very useful—worth remembering.
  489. To prove that the “hidden” primary key allows the ORDER BY to avoid a filesort, let’s remove the condition on column b to invalidate the optimization, as shown in Figure 2-21 and followed by the resulting EXPLAIN plan in Example 2-16.
  490. emsp 0221
  491. Figure 2-21. ORDER BY without leftmost prefix
  492. Example 2-16. EXPLAIN plan for ORDER BY without leftmost prefix
  493. EXPLAIN SELECT * FROM elem WHERE a = 'Al' ORDER BY id\G
  494. *************************** 1. row ***************************
  495. id: 1
  496. select_type: SIMPLE
  497. table: elem
  498. partitions: NULL
  499. type: ref
  500. possible_keys: idx_a_b
  501. key: idx_a_b
  502. key_len: 8
  503. ref: const
  504. rows: 3
  505. filtered: 100.00
  506. > Extra: Using index condition; Using filesort
  507. By removing the condition on column b, there’s no longer a leftmost prefix on the secondary index that allows MySQL to use the “hidden” primary key to optimize ORDER BY. Therefore, for this particular query, “Using filesort” appears in the Extra field.
  508. The new optimization isUsing index condition,” which is called index condition pushdown. Index condition pushdown means the storage engine uses an index to matches rows for WHERE conditions. Normally, storage engines only read and write rows, and MySQL handles the logic of matching rows. This is a clean separation of concerns (which is a virtue for software design), but it’s inefficient when rows don’t match: both MySQL and the storage engine waste time reading non-matching rows. For the query in Example 2-16, index condition pushdown means the storage engine (InnoDB) uses index idx_a_b to match condition a = 'Al'. Index condition pushdown helps improve response time, but don’t exert yourself trying to optimize for it because MySQL uses it automatically when possible. To learn more, read “Index Condition Pushdown Optimization” in the MySQL manual.
  509. There’s an important detail that affects all ORDER BY optimizations: index order is ascending by default, and ORDER BY col implies ascending: ORDER BY col ASC. Optimizing ORDER BY works in only one direction for all columns: ASC (ascending) or DESC (descending). Consequently, ORDER BY a, b DESC does not work because column a is an implicit ASC sort, which is different than b DESC.
  510. NOTE
  511. MySQL 8.0 supports descending indexes.
  512. What is the real time penalty of filesort? Prior to MySQL 8.0.18, it was neither measured nor reported. But as of MySQL 8.0.18, EXPLAIN ANALYZE measures and reports it. For only Example 2-17, I must use a different table.
  513. Example 2-17. Sysbench table sbtest
  514. CREATE TABLE `sbtest1` (
  515. `id` int NOT NULL AUTO_INCREMENT,
  516. `k` int NOT NULL DEFAULT '0',
  517. `c` char(120) NOT NULL DEFAULT '',
  518. `pad` char(60) NOT NULL DEFAULT '',
  519. PRIMARY KEY (`id`),
  520. KEY `k_1` (`k`)
  521. ) ENGINE=InnoDB;
  522. That’s a standard sysbench table; I loaded it with one million rows. Let’s use a random, meaningless query with a large result set and ORDER BY:
  523. SELECT c FROM sbtest1 WHERE k < 450000 ORDER BY id;
  524. -- Output omitted
  525. 68439 rows in set (1.15 sec)
  526. The query takes 1.15 seconds to sort and return a little over 68,000 rows. But it’s not a bad query; check out its EXPLAIN plan:
  527. EXPLAIN SELECT c FROM sbtest1 WHERE k < 450000 ORDER BY id\G
  528. *************************** 1. row ***************************
  529. id: 1
  530. select_type: SIMPLE
  531. table: sbtest1
  532. partitions: NULL
  533. type: range
  534. possible_keys: k_1
  535. key: k_1
  536. key_len: 4
  537. ref: NULL
  538. rows: 133168
  539. filtered: 100.00
  540. Extra: Using index condition; Using MRR; Using filesort
  541. The only new information in that EXPLAIN plan isUsing MRR” in the Extra field, which refers to the “Multi-Range Read Optimization”. Otherwise, that EXPLAIN plan reports information already covered in this chapter.
  542. Does filesort make this query slow? EXPLAIN ANALYZE reveals the answer, albeit cryptically:
  543. EXPLAIN ANALYZE SELECT c FROM sbtest1 WHERE k < 450000 ORDER BY id\G
  544. *************************** 1. row ***************************
  545. 1 -> Sort: sbtest1.id (cost=83975.47 rows=133168)
  546. 2 (actual time=1221.170..1229.306 rows=68439 loops=1)
  547. 3 -> Index range scan on sbtest1 using k_1, with index condition: (k<450000)
  548. 4 (cost=83975.47 rows=133168) (actual time=40.916..1174.981 rows=68439)
  549. The real output of EXPLAIN ANALYZE is wider, but I wrapped and numbered the lines for print legibility and reference. EXPLAIN ANALYZE output is dense and requires practice to grok; for now, let’s go straight to the point—or as straight as possible since the output does not read sequentially. On line 4, 1174.981 (milliseconds) means the index range scan (line 3) took 1.17 seconds (rounded). On line 2, 1221.170..1229.306 means the filesort (line 1) started after 1,221 milliseconds and ended after 1,229 milliseconds, which means the filesort took 8 milliseconds. Total execution time is 1.23 seconds: 95% reading rows and less than 1% sorting rows. The remaining 4%—roughly 49 milliseconds—is spent in other stages: preparing, statistics, logging, cleaning up, and so forth.
  550. The answer is no: filesort does not make this query slow. The problem is data access: 68,439 rows is not a small result set. Sorting 68,439 values is practically zero work for a CPU that does billions of operations per second. But reading 68,439 rows is appreciable work for a relational database that must traverse indexes, manage transactions, etc. To optimize a query like this, focus on “Data Access”.
  551. One last question to address: why does filesort have a reputation for being slow? Because MySQL uses temporary files on disk when sorting data exceeds the sort_buffer_size, and hard drives are orders of magnitude slower than memory. This was especially true decades ago when spinning disks were the norm; but today, SSD is the norm, and storage in general is quite fast. Filesort might be an issue for a query at high throughput (QPS), but use EXPLAIN ANALYZE to measure and verify.
  552. WARNING
  553. EXPLAIN ANALYZE executes the query. To be safe, use EXPLAIN ANALYZE on a read-only replica, not the source.
  554. Now back to table elem (Example 2-1) and the next case for which MySQL can use an index: covering indexes.
  555. Covering Indexes
  556. A covering index includes all columns referenced in a query. Figure 2-22 shows a covering index for a SELECT statement.
  557. emsp 0222
  558. Figure 2-22. Covering indexes
  559. The WHERE conditions on columns a and b point to the corresponding index columns as usual, but these index columns also point back to the corresponding columns in the SELECT clause to signify that the values for these columns are read from the index.
  560. Normally, MySQL reads full rows from the primary key (recall “InnoDB Tables Are Indexes”). But with a covering index, MySQL can read only column values from the index. This is most helpful with secondary indexes because it avoids the primary key lookup.
  561. MySQL uses the covering index optimization automtically, and EXPLAIN reports it asUsing index” in the Extra field. “Using index for group-byis a similar optimization specific to GROUP BY and DISTINCT, as demonstrated inGROUP BY”. But “Using index conditionandUsing index for skip scan” are completely different and unrelated optimizations.
  562. An index scan (type: index) plus a covering index (Extra: Using index) is an index-only scan (see “Index scan”). There are two examples inGROUP BY”: Example 2-9 and Example 2-12.
  563. Covering indexes are glamorous but rarely practical because realistic queries have too many columns, conditions, and clauses for one index to cover. Do not spend time trying to create covering indexes. When designing or analyzing simple queries that use very few columns, take a moment to see if a covering index might work. If it does, then congratulations. If not, that’s okay; no one expects covering indexes.
  564. Join Tables
  565. MySQL uses an index to join tables, and this usage is fundamentally the same as using an index for anything else. The main difference is the source of values used in join conditions for each table. This becomes more clear when visualized, but first we need a second table to join. Example 2-18 shows the structure of table elem_names and the 14 rows that it contains.
  566. Example 2-18. Table elem_names
  567. CREATE TABLE `elem_names` (
  568. `symbol` char(2) NOT NULL,
  569. `name` varchar(16) DEFAULT NULL,
  570. PRIMARY KEY (`symbol`)
  571. ) ENGINE=InnoDB;
  572. +--------+-----------+
  573. | symbol | name |
  574. +--------+-----------+
  575. | Ag | Silver |
  576. | Al | Aluminum |
  577. | Ar | Argon |
  578. | At | Astatine |
  579. | Au | Gold |
  580. | B | Boron |
  581. | Be | Beryllium |
  582. | Bi | Bismuth |
  583. | Br | Bromine |
  584. | C | Carbon |
  585. | Cd | Cadmium |
  586. | Ce | Cerium |
  587. | Co | Cobalt |
  588. | Cr | Chromium |
  589. +--------+-----------+
  590. Table elem_name has one index: the primary key on column symbol. The values in column symbol match the values in table elem columns a, b, and c. Therefore, we can join tables elem and elem_names on these columns.
  591. Figure 2-23 shows a SELECT statement that joins tables elem and elem_names, and a visual representation of the conditions and indexes for each table.
  592. In previous figures, there’s only one index and SQL clause pair because there’s only one table. But Figure 2-23 has two pairs—one for each table—delineated by large rightward-pointing chevrons with the table name commented in each: /* elem */ and /* elem_names */. Like EXPLAIN, these figures list tables in join order: top to bottom. Table elem (at top) is the first table in the join order and table elem_names (at bottom) is the second table.
  593. emsp 0223
  594. Figure 2-23. Join table on primary key lookup
  595. Index usage on table elem is nothing new or special: MySQL uses the index for the condition a IN (…). So far, so good.
  596. Index usage on table elem_names, which is joined to the preceding table, is fundamentally the same with two minor differences. First, the WHERE clause is a rewrite of the JOINON clause—more on this later. Second, values for the condition on column symbol come from the preceding table: elem. To represent this, an arrow points from the preceding table to a column reference in angle brackets: <elem.a>. On join, MySQL looks up rows in table elem_names using column a values from matching rows in table elem for the join condition on column symbol. In MySQL vernacular we’d say, “symbol is equal to column a from table elem.” Given a value from the preceding table, the primary key lookup on column symbol is nothing new or special: if a row matches, it’s returned and joined with the row from the preceding table.
  597. Example 2-19 shows the EXPLAIN plan for the SELECT statement in Figure 2-23.
  598. Example 2-19. EXPLAIN plan for join table on primary key lookup
  599. EXPLAIN SELECT name
  600. FROM elem JOIN elem_names ON (elem.a = elem_names.symbol)
  601. WHERE a IN ('Ag', 'Au', 'At')\G
  602. *************************** 1. row ***************************
  603. id: 1
  604. select_type: SIMPLE
  605. table: elem
  606. partitions: NULL
  607. type: range
  608. possible_keys: idx_a_b
  609. key: idx_a_b
  610. key_len: 3
  611. ref: NULL
  612. rows: 4
  613. filtered: 100.00
  614. Extra: Using where; Using index
  615. *************************** 2. row ***************************
  616. id: 1
  617. select_type: SIMPLE
  618. table: elem_names
  619. partitions: NULL
  620. > type: eq_ref
  621. possible_keys: PRIMARY
  622. key: PRIMARY
  623. key_len: 2
  624. > ref: test.elem.a
  625. rows: 1
  626. filtered: 100.00
  627. Extra: NULL

Example 2-19. EXPLAIN plan for join table on primary key lookup

在每个表的基础上,例 2-19 中的 EXPLAIN PLAN 并不是什么新鲜玩意,但是连接揭示了第二个表 elem_names 中的两个新细节。 第一个是访问类型【 access type】 eq_ref:使用主键或not-null 二级索引下的单行查找(在这种情况下,not-null  表示所有二级索引列都定义为 NOT NULL)。 下面会有关于 eq_ref 访问类型的更多信息。 第二个是  ref: test.elem.a,你可以读作“引用自 elem.a”(数据库名称是 test,因此是 test 前缀)。 要连接表 elem_names,引用列 elem.a 中的值被用于按主键(key: PRIMARY)查找行,它涵盖了连接列:symbol。 这对应于 JOIN 条件:ON (elem.a = elem_names.symbol)。

TIP 在每个表的基础上,连接不会改变索引的使用方式。 主要区别在于连接条件的值来自上个表。

MySQL 可以使用任何访问方法【access method】连接表(请参阅 “Table Access Methods”),但是使用 eq_ref 访问类型【 access type】的索引查找【index lookup】是最好和最快的,因为它只匹配一行。 eq_ref 访问类型有两个要求:主键或唯一的非null二级索引以及所有索引列上的相等条件。 这些要求一起保证 eq_ref 查找最多匹配一行。 如果这两个要求都没有满足,那么 MySQL 可能会使用 ref 索引查找【index lookup】,它本质上是相同的,但匹配任意数量的行。

回到图 2-23,我怎么知道将 JOIN...ON 子句重写为表 elem_names 的 WHERE 子句? 如果你在 EXPLAIN 之后立即 SHOW WARNINGS,MySQL 会打印出它是如何重写查询的。 这是 SHOW WARNINGS 的删节输出:

  1. /* select#1 */ select
  2. `test`.`elem_names`.`name` AS `name`
  3. from
  4. `test`.`elem`
  5. join `test`.`elem_names`
  6. where
  7. ((`test`.`elem_names`.`symbol` = `test`.`elem`.`a`)
  8. and (`test`.`elem`.`a` in ('Ag','Au','At')))

现在您可以看到图 2-23 中的/* elem_names */ WHERE symbol = <elem.a> 是正确的。

有时,在 EXPLAIN 之后立即运行 SHOW WARNINGS 以查看 MySQL 如何重写查询对于了解 MySQL 选择的表连接顺序和索引是必要的。

NOTE SHOW WARNINGS 显示的重写的 SQL 语句不是有效的。 它们只是为了展示 MySQL 如何解释和重写 SQL 语句。 不要执行它们。

表连接顺序很关键,因为 MySQL 以尽可能的最佳顺序连接表,而不是在查询中写入的顺序表。 您必须使用 EXPLAIN 来查看表连接顺序。 EXPLAIN 按连接顺序从顶部(第一个表)到底部(最后一个表)打印表。 默认连接算法 nested-loop join 也遵循连接顺序。 我在本章末尾概述了连接算法:“Table Join Algorithms”

永远不要猜测或假设表连接顺序,因为对查询的微小更改可能会产生显着不同的表连接顺序或查询执行计划。 为了演示,图 2-24 中的 SELECT 语句与图 2-23 中的 SELECT 语句几乎相同,只是有一点不同——你能发现吗?

Figure 2-24. Join table on secondary index lookup

这里有一个提示:它既不是黄金也不是白银。 微小的差异会产生截然不同的查询执行计划,如示例 2-20 所示。

  1. EXPLAIN SELECT name
  2. FROM elem JOIN elem_names ON (elem.a = elem_names.symbol)
  3. WHERE a IN ('Ag', 'Au')\G
  4. *************************** 1. row ***************************
  5. id: 1
  6. select_type: SIMPLE
  7. table: elem_names
  8. partitions: NULL
  9. type: range
  10. possible_keys: PRIMARY
  11. key: PRIMARY
  12. key_len: 2
  13. ref: NULL
  14. rows: 2
  15. filtered: 100.00
  16. Extra: Using where
  17. *************************** 2. row ***************************
  18. id: 1
  19. select_type: SIMPLE
  20. table: elem
  21. partitions: NULL
  22. type: ref
  23. possible_keys: idx_a_b
  24. key: idx_a_b
  25. key_len: 3
  26. ref: test.elem_names.symbol
  27. rows: 2
  28. filtered: 100.00
  29. Extra: Using index

Example 2-20. EXPLAIN plan for join table on secondary index lookup

在语法上,图 2-23 和 2-24 中的 SELECT 语句是相同的,但执行计划(示例 2-19 和 2-20)有很大不同。 这发生了什么变化? 在图 2-24 中,从 IN() 列表中删除了一个值:“At”。 这是一个很好的例子,说明一个看似无害的更改如何在 MySQL 查询执行计划程序中触发某些东西,瞧:一个全新的和不同的 EXPLAIN PLAN。 让我们逐表检查示例 2-20。

第一个表是 elem_names,它与查询【query】的编写方式不同:elem JOIN elem_names。 MySQL 确定表连接顺序,而不是 JOIN 子句。2 type 和 key 字段表示对主键进行范围扫描,但值来自哪里? ref 字段为 NULL,并且该表上没有 WHERE 条件。 MySQL 必定重写了查询【quer】; 这是 SHOW WARNINGS 的删节版输出:

  1. /* select#1 */ select
  2. `test`.`elem_names`.`name` AS `name`
  3. from
  4. `test`.`elem` join `test`.`elem_names`
  5. where
  6. ((`test`.`elem`.`a` = `test`.`elem_names`.`symbol`)
  7. and (`test`.`elem_names`.`symbol` in ('Ag','Au')))

是的,它在最后一行:MySQL 重写查询以使用 IN() 列表作为 elem_names.symbols 的值,而不是最初在查询中写入的 elem.a。 现在您可以看到(或想象)表 elem_names.symbols 上的索引使用情况是查找两个值的范围扫描:“Ag”和“Au”。 使用主键,这将是一个非常快速的索引查找,并且只匹配 MySQL 将用来连接第二个表的两行。

第二张表是elem,EXPLAIN PLAN 很熟悉:使用索引idx_a_b查找与 a 列条件匹配的索引值(不是行,因为 Extra: Using index)。 该条件的值来自上个表中的匹配行,如 ref: test.elem_names.symbol 所示。

TIP MySQL 以可能的最佳顺序连接表,而不是表在查询中的写入顺序。

尽管 MySQL 可以更改连接顺序并重写查询,但连接的索引使用基本上是相同的——在每个表的基础上——正如本章前面演示和解释的所有内容。 使用 EXPLAIN 和 SHOW WARNINGS,从上到下逐表考虑执行计划。

MySQL 可以连接没有索引的表。 这称为完全连接【 full join】,它是查询所能做的最糟糕的事情。 对单表查询进行表扫描【table scan】是不好的,但完全连接【full join】更糟糕,因为对连接表的表扫描【table scan】不会发生一次,它会发生在前一个表中的每个匹配行上。 示例 2-21 显示了对第二个表的完全连接【 full join】。

  1. EXPLAIN SELECT name
  2. FROM elem STRAIGHT_JOIN elem_names IGNORE INDEX (PRIMARY)
  3. ON (elem.a = elem_names.symbol)\G
  4. *************************** 1. row ***************************
  5. id: 1
  6. select_type: SIMPLE
  7. table: elem
  8. partitions: NULL
  9. type: index
  10. possible_keys: idx_a_b
  11. key: idx_a_b
  12. key_len: 6
  13. ref: NULL
  14. rows: 10
  15. filtered: 100.00
  16. Extra: Using index
  17. *************************** 2. row ***************************
  18. id: 1
  19. select_type: SIMPLE
  20. table: elem_names
  21. partitions: NULL
  22. type: ALL
  23. possible_keys: NULL
  24. key: NULL
  25. key_len: NULL
  26. ref: NULL
  27. rows: 14
  28. filtered: 7.14
  29. Extra: Using where; Using join buffer (hash join)

Example 2-21. EXPLAIN plan for full JOIN

通常,MySQL 不会选择这个查询执行计划,这就是为什么我不得不用 STRAIGHT_JOIN 和 IGNORE INDEX (PRIMARY) 来强制使用它。 对第一个表 (elem) 进行仅索引扫描【 index-only scan 】会产生所有十行。对于每一行,MySQL 通过执行全表扫描 (type: ALL) 来连接第二个表 (elem_names) 以查找匹配的行。 由于这是一个连接表(不是连接顺序中的第一个表),因此表扫描算作完全连接【full join】。 完全连接【full join】是查询可以做的最糟糕的事情,因为它发生在前一个表中的每一行:对表 elem_names 进行十次全表扫描。 每当您在被连接的表上看到 type: ALL,请停止您正在做的所有事情并修复它。 完全连接有一个查询指标:“Select full join”

Extra 字段中的“Using join buffer (hash join)”是指哈希连接算法,它是 MySQL 8.0.18 的新算法。 我在本章末尾概述了它(和其他连接算法):“Table Join Algorithms”。 展望未来,单行解释是:hash join 构建了一个内存中的值哈希表,并使用它来查找行,而不是进行重复的表扫描。 hash join 是一个巨大的性能改进。 无论如何,避免完全连接【full join】仍然是最佳实践。

乍一看,连接表似乎是一种完全不同类型的索引使用,但事实并非如此。 连接涉及更多的表和索引,但在每个表的基础上,索引的使用和要求是相同的。 即使是最左前缀条件也是一样的。 主要区别在于,对于连接表,连接条件的值来自上表。

从 “WHERE”的第一个示例开始,我们已经阅读了很长时间。现在您已经看到了许多涵盖索引、查询和EXPLAIN PLAN 的完整上下文示例,这些示例涵盖了 MySQL 索引的技术细节和机制。这些信息是下一部分构建的直接查询优化的基础。

Indexing: How to Think Like MySQL

indexes 和 indexing  是不同的主题。 前面的章节介绍的是索引【Indexes 】:InnoDB 表上的标准 B-tree 索引,用于 WHERE、GROUP BY、ORDER BY、覆盖索引【covering indexes】和表连接【table joins】。 本节介绍  indexing :应用索引【indexes】以获得最大化杠杆。 您不能简单地为每个列增加索引,以期望实现惊人的性能。 如果这么简单,就不会有 DBA。 为了最大限度地发挥作用,您必须索引那些允许 MySQL 在执行查询时访问最少行数的列。 打个比方:最大杠杆是一个索引,它告诉 MySQL 在大海中捞针的确切位置。

根据我的经验,工程师一直挣扎于 indexing  ,因为他们将自己对查询【query】的看法与 MySQL 对查询的“看法”混为一谈。 作为工程师,我们在应用程序的上下文中考虑查询:应用程序的哪个部分执行查询、为什么(业务逻辑)以及正确的结果集。 但是 MySQL 不知道也不关心这些。 MySQL考虑的是更小、更简单的上下文:索引和表条件。实际上,MySQL要复杂得多,但它的部分魅力在于它如何将这种复杂性隐藏起来。

我们怎么知道 MySQL 是如何思考索引【indexes 】和表条件【 table conditions】? EXPLAIN。 EXPLAIN 报告的主要信息是什么? 表【Tables 】(按连接顺序)、表访问方法【table access methods】、索引【indexes】以及与使用这些索引访问这些表相关的 Extra 信息。

像 MySQL 一样思考使 indexing  变得更容易,因为它是一种确定性机器——算法和启发式【algorithms and heuristics】。 人类的思想与多余的细节纠缠在一起。 清理你的头脑,准备好像机器一样思考。 接下来的四个部分将介绍一个简单的四步过程。

Know the Query

像 MySQL 一样思考的第一步是了解您正在优化的查询【query】的基本信息。 首先为每个表收集以下元数据:

  • SHOW CREATE TABLE

  • SHOW TABLE STATUS

  • SHOW INDEXES

如果查询【query】已经在生产环境中运行,则获取其查询报告【query report】(请参阅“Query report”)并熟悉当前值。

然后回答以下问题:

Query

  • How many rows should the query access?
  • How many rows should the query return?
  • Which columns are selected (returned)?
  • What are the GROUP BYORDER BY, and LIMIT clauses (if any)?
  • Are there subqueries? (If yes, repeat the process for each.)

Table access (per-table)

  • What are the table conditions?
  • Which index should the query use?
  • What other indexes could the query use?
  • What is the cardinality of each index?
  • How large is the table—data size and row count?

这些问题可以帮助您在心里解析查询,因为这就是 MySQL 所做的:解析查询【 parse the query】。 这对于以更简单的术语(表、表条件、索引和 SQL 子句)查看复杂查询特别有帮助。

此信息可帮助您拼凑出一个谜题,一旦完成,就会显示出查询响应时间。 为了提高响应时间,您需要更改一些部分。 但在此之前,下一步是在 EXPLAIN 的帮助下组装当前的部件。

Understand with EXPLAIN

第二步,了解 EXPLAIN 的当前查询执行计划。 考虑每个表及其与索引相关的条件,从 MySQL 选择的索引开始:EXPLAIN 输出中的关键字段。 查看表条件以了解它们如何满足此索引的最左侧前缀要求。 如果 possible_keys 字段列出了其他索引,请考虑 MySQL 如何使用这些索引访问行——始终牢记最左边的前缀要求。 如果 Extra 字段有信息(通常有),请参阅 MySQL 手册中的“EXPLAIN Output”以了解其含义。

第二步,了解由 EXPLAIN 报告的当前的查询执行计划【query execution plan】。考虑每个表及其与索引相关的条件,从 MySQL 选择的索引开始:EXPLAIN 输出中的 key 字段。查看表条件,看看它们如何满足这个索引的最左前缀条件。如果possible_keys 字段列出了其他索引,请考虑MySQL 如何使用这些索引访问行——始终牢记最左前缀条件。 如果 Extra 字段有信息(通常是有的),请参阅 MySQL 手册中的“EXPLAIN Output” 以了解其含义。

查询及其响应时间是一个难题,但您拥有所有部分:执行计划【execution plan】、表条件【 table conditions】、表结构【table structures】、表大小【table sizes】、索引基数【index cardinalities】和查询指标【query metrics】。 继续连接各个部分,直到拼图完成 - 直到您可以看到查询【query】按照 MySQL 解释的方式工作。 查询执行计划【query execution plan】总是有原因的。有时 MySQL 非常聪明,使用了不明显的查询优化,通常会在 Extra 字段中提到。 如果您遇到一个 SELECT 语句,MySQL 手册中的“Optimizing SELECT Statements” 将对其进行说明。

如果您遇到困难,可以提供三个不断增加的支持级别:

  • 从 MySQL 8.0.16 开始,EXPLAIN FORMAT=TREE 在树状输出中打印更精确和描述性的查询执行计划。 这是与传统格式完全不同的输出,因此您需要学习如何解释它,但值得付出努力。
  • 使用优化器跟踪【optimizer tracing】报告极其详细的查询执行计划,包括成本【costs】、注意事项【considerations】和原因【reason】。 这是一个非常高级的功能,学习曲线很高,所以如果你时间紧迫,你可能更喜欢第三个选项。
  • 询问您的 DBA 或聘请专家。

Optimize the Query

第三步是直接查询优化【direct query optimization】:改变查询【query】、它的索引【indexes】,或者两者都改变。 这是所有乐趣发生的地方,而且还没有风险,因为这些更改是在开发或交付中进行的,而不是在生产中进行的。 确保您的开发或预发环境具有代表生产的数据,因为数据大小和分布会影响 MySQL 选择索引的方式。

起初,查询似乎无法修改,因为它获取了正确的行,因此查询【query】编写正确。 一个查询“is what it is,”,对吗? 不总是; 使用不同的方法可以达到相同的结果。 一个查询有一个结果——字面意思是一个结果集——和一个获得该结果的方法。 这两者密切相关,但又相互独立。 在考虑如何修改查询时知道这一点非常有帮助。 首先澄清查询的预期结果。 清晰的结果允许您探索编写查询的新方法,以实现相同的结果。

TIP 可以有多种方法来编写执行不同但返回相同结果的查询。

例如,前段时间我正在帮助一位工程师优化一个慢查询。 他向我提出的问题是技术性的——关于 GROUP BY 和索引的问题——但我问他,“这个查询是做什么的? 它应该返回什么?” 他说:“哦! 它返回一个分组的最大值。” 在明确了查询的预期结果后,我意识到他不需要最大分组值,他只需要最大值。 因此,查询被完全重写以使用 ORDER BY col DESC LIMIT 1 优化。

当查询非常简单时,例如 SELECT col FROM tbl WHERE id = 1,可能真的没有办法重写它。 但查询越简单,需要重写的可能性就越小。 如果一个简单的查询很慢,解决方案可能是更改索引而不是查询。 (如果索引更改不能解决问题,那么旅程将继续:间接查询优化【 indirect query optimization】,在第 3 章和第 4 章中讨论。)

添加或修改索引是访问方法【 access methods】和特定于查询的优化【query-specific optimizations】之间的权衡。 例如,您是否用 ORDER BY 优化换取范围扫描?不要纠结于权衡利弊; MySQL 会为您做到这一点。您的工作很简单:添加或更改您认为将为 MySQL 提供更大影响力的索引,然后使用 EXPLAIN 查看 MySQL 是否同意使用新索引。 重复,直到您和 MySQL 就编写【write】、索引【index】和执行【execute 】该查询【query】的最优化方式达成一致。

Deploy and Verify

最后一步是部署更改并验证它们是否可以缩短响应时间。 但首先:知道如何回滚部署——并准备好这样做——以防更改产生意外的副作用。 发生这种情况有很多原因; 两个示例是:在生产环境中运行的查询使用了索引,但未在预发中运行,或者生产数据与预发数据显着不同。 它很可能会好起来,但要做好万一的准备。

部署后,使用查询指标【 query metric】和 MySQL 服务器指标【MySQL server metrics】验证我们的更改。 如果查询优化有显着影响,MySQL 服务器指标【MySQL server metrics】会反映出来。 发生这种情况时非常棒,但如果没有发生,请不要感到惊讶或气馁,因为最重要的变化是查询响应时间——回忆一下“North Star”

等待五到十分钟(最好更长),然后检查查询配置文件【query profile 】和查询报告【query report.】中的响应时间(请参阅“Query profile”“Query report”.)。 如果响应时间得到改善,那么恭喜:您正在做并完成 MySQL 专家所做的事情; 有了这个技能,你就可以实现卓越的 MySQL 性能。 如果响应时间没有改善,请不要担心,也不要放弃:即使是 MySQL 专家也会遇到需要麻烦的查询。 重复该过程,并考虑招募另一位工程师,因为有些查询需要繁重的工作。 如果您确定无法进一步优化查询,那么就该进行第二部分的旅程了:间接查询优化【 indirect query optimization】。 第 3 章讨论对数据的更改,第 4 章讨论对应用程序的更改。

It Was a Good Index Until…

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/139458?site
推荐阅读
相关标签
  

闽ICP备14008679号