【MySQL】utft8mb4 字符集及其排序规则（字符集校验规则）_utf8mb4排序规则

作者：笔触狂放9 | 2024-07-20 22:03:42

踩

utf8mb4排序规则

UTF-8 是 Unicode 的一种实现方式，它可以表示世界上绝大多数的字符，包括大部分的中文字符。MySQL 从 5.5.3 版本开始支持 UTF-8 字符集，其中包括 UTF-8MB4。UTF-8MB4 是 MySQL 支持的最大的字符集，它可以表示 4 字节的 Unicode 字符，包括 Emoji 表情符号等。

UTF-8MB4 与其他 UTF-8 字符集的区别在于，它可以表示更多的 Unicode 字符，特别是那些需要 4 字节来表示的字符。如果你需要存储和处理 Emoji 表情符号等特殊字符，建议使用 UTF-8MB4 字符集。

需要注意的是，UTF-8MB4 字符集需要更多的存储空间和处理时间，因此在使用时需要考虑到性能问题。如果你不需要存储和处理特殊字符，可以考虑使用较小的 UTF-8 字符集。

与其他字符集一样，utf8mb4也支持很多排序规则（字符集校验规则）

查询方式

-- 内置命令查询
show collation where Charset='utf8mb4';
-- SQL 查询 
select * from information_schema.collations where character_set_name = "utf8mb4" order by COLLATION_NAME;
1
2
3
4

查询结果

# 使用内置命令
mysql> show collation where Charset='utf8mb4';
+----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation                  | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci         | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |
| utf8mb4_0900_as_ci         | utf8mb4 | 305 |         | Yes      |       0 | NO PAD        |
| utf8mb4_0900_as_cs         | utf8mb4 | 278 |         | Yes      |       0 | NO PAD        |
| utf8mb4_0900_bin           | utf8mb4 | 309 |         | Yes      |       1 | NO PAD        |
| utf8mb4_bin                | utf8mb4 |  46 |         | Yes      |       1 | PAD SPACE     |
| utf8mb4_croatian_ci        | utf8mb4 | 245 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_cs_0900_ai_ci      | utf8mb4 | 266 |         | Yes      |       0 | NO PAD        |
| utf8mb4_cs_0900_as_cs      | utf8mb4 | 289 |         | Yes      |       0 | NO PAD        |
| utf8mb4_czech_ci           | utf8mb4 | 234 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_danish_ci          | utf8mb4 | 235 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_da_0900_ai_ci      | utf8mb4 | 267 |         | Yes      |       0 | NO PAD        |
| utf8mb4_da_0900_as_cs      | utf8mb4 | 290 |         | Yes      |       0 | NO PAD        |
| utf8mb4_de_pb_0900_ai_ci   | utf8mb4 | 256 |         | Yes      |       0 | NO PAD        |
| utf8mb4_de_pb_0900_as_cs   | utf8mb4 | 279 |         | Yes      |       0 | NO PAD        |
| utf8mb4_eo_0900_ai_ci      | utf8mb4 | 273 |         | Yes      |       0 | NO PAD        |
| utf8mb4_eo_0900_as_cs      | utf8mb4 | 296 |         | Yes      |       0 | NO PAD        |
| utf8mb4_esperanto_ci       | utf8mb4 | 241 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_estonian_ci        | utf8mb4 | 230 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_es_0900_ai_ci      | utf8mb4 | 263 |         | Yes      |       0 | NO PAD        |
| utf8mb4_es_0900_as_cs      | utf8mb4 | 286 |         | Yes      |       0 | NO PAD        |
| utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 |         | Yes      |       0 | NO PAD        |
| utf8mb4_es_trad_0900_as_cs | utf8mb4 | 293 |         | Yes      |       0 | NO PAD        |
| utf8mb4_et_0900_ai_ci      | utf8mb4 | 262 |         | Yes      |       0 | NO PAD        |
| utf8mb4_et_0900_as_cs      | utf8mb4 | 285 |         | Yes      |       0 | NO PAD        |
| utf8mb4_general_ci         | utf8mb4 |  45 |         | Yes      |       1 | PAD SPACE     |
| utf8mb4_german2_ci         | utf8mb4 | 244 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_hr_0900_ai_ci      | utf8mb4 | 275 |         | Yes      |       0 | NO PAD        |
| utf8mb4_hr_0900_as_cs      | utf8mb4 | 298 |         | Yes      |       0 | NO PAD        |
| utf8mb4_hungarian_ci       | utf8mb4 | 242 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_hu_0900_ai_ci      | utf8mb4 | 274 |         | Yes      |       0 | NO PAD        |
| utf8mb4_hu_0900_as_cs      | utf8mb4 | 297 |         | Yes      |       0 | NO PAD        |
| utf8mb4_icelandic_ci       | utf8mb4 | 225 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_is_0900_ai_ci      | utf8mb4 | 257 |         | Yes      |       0 | NO PAD        |
| utf8mb4_is_0900_as_cs      | utf8mb4 | 280 |         | Yes      |       0 | NO PAD        |
| utf8mb4_ja_0900_as_cs      | utf8mb4 | 303 |         | Yes      |       0 | NO PAD        |
| utf8mb4_ja_0900_as_cs_ks   | utf8mb4 | 304 |         | Yes      |      24 | NO PAD        |
| utf8mb4_latvian_ci         | utf8mb4 | 226 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_la_0900_ai_ci      | utf8mb4 | 271 |         | Yes      |       0 | NO PAD        |
| utf8mb4_la_0900_as_cs      | utf8mb4 | 294 |         | Yes      |       0 | NO PAD        |
| utf8mb4_lithuanian_ci      | utf8mb4 | 236 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_lt_0900_ai_ci      | utf8mb4 | 268 |         | Yes      |       0 | NO PAD        |
| utf8mb4_lt_0900_as_cs      | utf8mb4 | 291 |         | Yes      |       0 | NO PAD        |
| utf8mb4_lv_0900_ai_ci      | utf8mb4 | 258 |         | Yes      |       0 | NO PAD        |
| utf8mb4_lv_0900_as_cs      | utf8mb4 | 281 |         | Yes      |       0 | NO PAD        |
| utf8mb4_persian_ci         | utf8mb4 | 240 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_pl_0900_ai_ci      | utf8mb4 | 261 |         | Yes      |       0 | NO PAD        |
| utf8mb4_pl_0900_as_cs      | utf8mb4 | 284 |         | Yes      |       0 | NO PAD        |
| utf8mb4_polish_ci          | utf8mb4 | 229 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_romanian_ci        | utf8mb4 | 227 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_roman_ci           | utf8mb4 | 239 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_ro_0900_ai_ci      | utf8mb4 | 259 |         | Yes      |       0 | NO PAD        |
| utf8mb4_ro_0900_as_cs      | utf8mb4 | 282 |         | Yes      |       0 | NO PAD        |
| utf8mb4_ru_0900_ai_ci      | utf8mb4 | 306 |         | Yes      |       0 | NO PAD        |
| utf8mb4_ru_0900_as_cs      | utf8mb4 | 307 |         | Yes      |       0 | NO PAD        |
| utf8mb4_sinhala_ci         | utf8mb4 | 243 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_sk_0900_ai_ci      | utf8mb4 | 269 |         | Yes      |       0 | NO PAD        |
| utf8mb4_sk_0900_as_cs      | utf8mb4 | 292 |         | Yes      |       0 | NO PAD        |
| utf8mb4_slovak_ci          | utf8mb4 | 237 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_slovenian_ci       | utf8mb4 | 228 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_sl_0900_ai_ci      | utf8mb4 | 260 |         | Yes      |       0 | NO PAD        |
| utf8mb4_sl_0900_as_cs      | utf8mb4 | 283 |         | Yes      |       0 | NO PAD        |
| utf8mb4_spanish2_ci        | utf8mb4 | 238 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_spanish_ci         | utf8mb4 | 231 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_sv_0900_ai_ci      | utf8mb4 | 264 |         | Yes      |       0 | NO PAD        |
| utf8mb4_sv_0900_as_cs      | utf8mb4 | 287 |         | Yes      |       0 | NO PAD        |
| utf8mb4_swedish_ci         | utf8mb4 | 232 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_tr_0900_ai_ci      | utf8mb4 | 265 |         | Yes      |       0 | NO PAD        |
| utf8mb4_tr_0900_as_cs      | utf8mb4 | 288 |         | Yes      |       0 | NO PAD        |
| utf8mb4_turkish_ci         | utf8mb4 | 233 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_unicode_520_ci     | utf8mb4 | 246 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_unicode_ci         | utf8mb4 | 224 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_vietnamese_ci      | utf8mb4 | 247 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_vi_0900_ai_ci      | utf8mb4 | 277 |         | Yes      |       0 | NO PAD        |
| utf8mb4_vi_0900_as_cs      | utf8mb4 | 300 |         | Yes      |       0 | NO PAD        |
| utf8mb4_zh_0900_as_cs      | utf8mb4 | 308 |         | Yes      |       0 | NO PAD        |
+----------------------------+---------+-----+---------+----------+---------+---------------+
75 rows in set (0.03 sec)

# 使用SQL 
mysql> select * from information_schema.collations where character_set_name = "utf8mb4" order by COLLATION_NAME;
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| COLLATION_NAME             | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| utf8mb4_0900_ai_ci         | utf8mb4            | 255 | Yes        | Yes         |       0 | NO PAD        |
| utf8mb4_0900_as_ci         | utf8mb4            | 305 |            | Yes         |       0 | NO PAD        |
| utf8mb4_0900_as_cs         | utf8mb4            | 278 |            | Yes         |       0 | NO PAD        |
| utf8mb4_0900_bin           | utf8mb4            | 309 |            | Yes         |       1 | NO PAD        |
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

各排序规则含义

校验规则	描述
utf8mb4_0900_ai_ci	基于Unicode 9.0.0版本，不区分大小写，适用于多语言环境，提供准确的排序结果。
utf8mb4_0900_as_ci	基于Unicode 9.0.0版本，不区分大小写，适用于多语言环境，提供准确的排序结果。
utf8mb4_0900_as_cs	基于Unicode 9.0.0版本，区分大小写，适用于多语言环境，提供准确的排序结果。
utf8mb4_0900_bin	基于Unicode 9.0.0版本，进行严格的二进制比较和排序，区分大小写和字符编码。
utf8mb4_bin	进行严格的二进制比较和排序，区分大小写和字符编码。
utf8mb4_croatian_ci	不区分大小写，适用于克罗地亚语的比较和排序。
utf8mb4_cs_0900_ai_ci	基于Unicode 9.0.0版本，不区分大小写，适用于捷克语的比较和排序。
utf8mb4_cs_0900_as_cs	基于Unicode 9.0.0版本，区分大小写，适用于捷克语的比较和排序。
utf8mb4_czech_ci	不区分大小写，适用于捷克语的比较和排序。
utf8mb4_danish_ci	不区分大小写，适用于丹麦语的比较和排序。
utf8mb4_da_0900_ai_ci	基于Unicode 9.0.0版本，适用于丹麦语，不区分大小写，提供准确的排序结果。
utf8mb4_da_0900_as_cs	基于Unicode 9.0.0版本，适用于丹麦语，区分大小写，提供准确的排序结果。
utf8mb4_de_pb_0900_ai_ci	基于Unicode 9.0.0版本，适用于德语（奥地利/瑞士），不区分大小写，提供准确的排序结果。
utf8mb4_de_pb_0900_as_cs	基于Unicode 9.0.0版本，适用于德语（奥地利/瑞士），区分大小写，提供准确的排序结果。
utf8mb4_eo_0900_ai_ci	基于Unicode 9.0.0版本，适用于世界语，不区分大小写，提供准确的排序结果。
utf8mb4_eo_0900_as_cs	基于Unicode 9.0.0版本，适用于世界语，区分大小写，提供准确的排序结果。
utf8mb4_esperanto_ci	不区分大小写，适用于世界语的比较和排序。
utf8mb4_estonian_ci	不区分大小写，适用于爱沙尼亚语的比较和排序。
utf8mb4_es_0900_ai_ci	基于Unicode 9.0.0版本，适用于西班牙语，不区分大小写，提供准确的排序结果。
utf8mb4_es_0900_as_cs	基于Unicode 9.0.0版本，适用于西班牙语，区分大小写，提供准确的排序结果。
utf8mb4_es_trad_0900_ai_ci	基于Unicode 9.0.0版本，适用于传统的西班牙语，不区分大小写，提供准确的排序结果。
utf8mb4_es_trad_0900_as_cs	基于Unicode 9.0.0版本，适用于传统的西班牙语，区分大小写，提供准确的排序结果。
utf8mb4_et_0900_ai_ci	基于Unicode 9.0.0版本，适用于爱沙尼亚语，不区分大小写，提供准确的排序结果。
utf8mb4_et_0900_as_cs	基于Unicode 9.0.0版本，适用于爱沙尼亚语，区分大小写，提供准确的排序结果。
utf8mb4_general_ci	不区分大小写的通用排序规则，适用于多语言环境。
utf8mb4_german2_ci	不区分大小写，适用于德语的比较和排序，提供更严格的排序规则。
utf8mb4_hr_0900_ai_ci	基于Unicode 9.0.0版本，适用于克罗地亚语，不区分大小写，提供准确的排序结果。
utf8mb4_hr_0900_as_cs	基于Unicode 9.0.0版本，适用于克罗地亚语，区分大小写，提供准确的排序结果。
utf8mb4_hungarian_ci	不区分大小写，适用于匈牙利语的比较和排序。
utf8mb4_hu_0900_ai_ci	基于Unicode 9.0.0版本，适用于匈牙利语，不区分大小写，提供准确的排序结果。
utf8mb4_hu_0900_as_cs	基于Unicode 9.0.0版本，适用于匈牙利语，区分大小写，提供准确的排序结果。
utf8mb4_icelandic_ci	不区分大小写，适用于冰岛语的比较和排序。
utf8mb4_is_0900_ai_ci	基于Unicode 9.0.0版本，适用于冰岛语，不区分大小写，提供准确的排序结果。
utf8mb4_is_0900_as_cs	基于Unicode 9.0.0版本，适用于冰岛语，区分大小写，提供准确的排序结果。
utf8mb4_ja_0900_as_cs	基于Unicode 9.0.0版本，适用于日语，区分大小写，提供准确的排序结果。
utf8mb4_ja_0900_as_cs_ks	基于Unicode 9.0.0版本，适用于日语，区分大小写和偏旁部首，提供准确的排序结果。
utf8mb4_latvian_ci	不区分大小写，适用于拉脱维亚语的比较和排序。
utf8mb4_la_0900_ai_ci	基于Unicode 9.0.0版本，适用于拉丁语系语言，不区分大小写，提供准确的排序结果。
utf8mb4_la_0900_as_cs	基于Unicode 9.0.0版本，适用于拉丁语系语言，区分大小写，提供准确的排序结果。
utf8mb4_lithuanian_ci	不区分大小写，适用于立陶宛语的比较和排序。
utf8mb4_lt_0900_ai_ci	基于Unicode 9.0.0版本，适用于立陶宛语，不区分大小写，提供准确的排序结果。
utf8mb4_lt_0900_as_cs	基于Unicode 9.0.0版本，适用于立陶宛语，区分大小写，提供准确的排序结果。
utf8mb4_lv_0900_ai_ci	基于Unicode 9.0.0版本，适用于拉脱维亚语，不区分大小写，提供准确的排序结果。
utf8mb4_lv_0900_as_cs	基于Unicode 9.0.0版本，适用于拉脱维亚语，区分大小写，提供准确的排序结果。
utf8mb4_persian_ci	不区分大小写，适用于波斯语的比较和排序。
utf8mb4_pl_0900_ai_ci	基于Unicode 9.0.0版本，适用于波兰语，不区分大小写，提供准确的排序结果。
utf8mb4_pl_0900_as_cs	基于Unicode 9.0.0版本，适用于波兰语，区分大小写，提供准确的排序结果。
utf8mb4_polish_ci	不区分大小写，适用于波兰语的比较和排序。
utf8mb4_romanian_ci	不区分大小写，适用于罗马尼亚语的比较和排序。
utf8mb4_roman_ci	不区分大小写，适用于罗马语系的比较和排序。
utf8mb4_ro_0900_ai_ci	基于Unicode 9.0.0版本，适用于罗马尼亚语，不区分大小写，提供准确的排序结果。
utf8mb4_ro_0900_as_cs	基于Unicode 9.0.0版本，适用于罗马尼亚语，区分大小写，提供准确的排序结果。
utf8mb4_ru_0900_ai_ci	基于Unicode 9.0.0版本，适用于俄语，不区分大小写，提供准确的排序结果。
utf8mb4_ru_0900_as_cs	基于Unicode 9.0.0版本，适用于俄语，区分大小写，提供准确的排序结果。
utf8mb4_sinhala_ci	不区分大小写，适用于僧伽罗语的比较和排序。
utf8mb4_sk_0900_ai_ci	基于Unicode 9.0.0版本，适用于斯洛伐克语，不区分大小写，提供准确的排序结果。
utf8mb4_sk_0900_as_cs	基于Unicode 9.0.0版本，适用于斯洛伐克语，区分大小写，提供准确的排序结果。
utf8mb4_slovak_ci	不区分大小写，适用于斯洛伐克语的比较和排序。
utf8mb4_slovenian_ci	不区分大小写，适用于斯洛文尼亚语的比较和排序。
utf8mb4_sl_0900_ai_ci	基于Unicode 9.0.0版本，适用于斯洛文尼亚语，不区分大小写，提供准确的排序结果。
utf8mb4_sl_0900_as_cs	基于Unicode 9.0.0版本，适用于斯洛文尼亚语，区分大小写，提供准确的排序结果。
utf8mb4_spanish2_ci	不区分大小写，适用于西班牙语的比较和排序。
utf8mb4_spanish_ci	不区分大小写，适用于西班牙语的比较和排序。
utf8mb4_sv_0900_ai_ci	基于Unicode 9.0.0版本，适用于瑞典语，不区分大小写，提供准确的排序结果。
utf8mb4_sv_0900_as_cs	基于Unicode 9.0.0版本，适用于瑞典语，区分大小写，提供准确的排序结果。
utf8mb4_swedish_ci	不区分大小写，适用于瑞典语的比较和排序。
utf8mb4_tr_0900_ai_ci	基于Unicode 9.0.0版本，适用于土耳其语，不区分大小写，提供准确的排序结果。
utf8mb4_tr_0900_as_cs	基于Unicode 9.0.0版本，适用于土耳其语，区分大小写，提供准确的排序结果。
utf8mb4_turkish_ci	不区分大小写，适用于土耳其语的比较和排序。
utf8mb4_unicode_520_ci	不区分大小写，基于Unicode 5.2.0版本的通用排序规则。
utf8mb4_unicode_ci	不区分大小写，基于Unicode的通用排序规则。
utf8mb4_vietnamese_ci	不区分大小写，适用于越南语的比较和排序。
utf8mb4_vi_0900_ai_ci	基于Unicode 9.0.0版本，适用于越南语，不区分大小写，提供准确的排序结果。
utf8mb4_vi_0900_as_cs	基于Unicode 9.0.0版本，适用于越南语，区分大小写，提供准确的排序结果。
utf8mb4_zh_0900_as_cs	基于Unicode 9.0.0版本，适用于中文，区分大小写，提供准确的排序结果。

常用的排序规则

校验规则	描述
utf8mb4_0900_ai_ci	基于Unicode 9.0.0版本，不区分大小写，适用于多语言环境，提供准确的排序结果。
utf8mb4_0900_as_ci	基于Unicode 9.0.0版本，不区分大小写，适用于多语言环境，提供准确的排序结果。
utf8mb4_0900_as_cs	基于Unicode 9.0.0版本，区分大小写，适用于多语言环境，提供准确的排序结果。
utf8mb4_0900_bin	基于Unicode 9.0.0版本，进行严格的二进制比较和排序，区分大小写和字符编码。
utf8mb4_bin	进行严格的二进制比较和排序，区分大小写和字符编码。
utf8mb4_general_ci	不区分大小写的通用排序规则，适用于多语言环境。
utf8mb4_unicode_520_ci	不区分大小写，基于Unicode 5.2.0版本的通用排序规则。
utf8mb4_unicode_ci	不区分大小写，基于Unicode的通用排序规则。
utf8mb4_zh_0900_as_cs	基于Unicode 9.0.0版本，适用于中文，区分大小写，提供准确的排序结果。

基于上表描述信息，utf8mb4_0900_ai_ci 是 utf8mb4 字符集的默认排序规则（字符集校验规则），然而一般软件开发中用户可能选择 utf8mb4_0900_as_cs 或 utf8mb4_bin 。

附

utf8mb4 字符集的默认校验规则为： utf8mb4_0900_ai_ci

排序规则命名规则：

字符集_[编码|语言][_重音][_大小写敏感][_bin]
字符集： utf8mb4
编码：可选，值可能为 0900、unicode 或 unicode_520
语言：校验规则为特别语言而定制，如 _zh 表示适用于中文，_da 表示适用于丹麦语
重音：ai 表示不区分重音，as 表示区分重音
大小写敏感： ci 表示不区分大小写，cs 表示区分大小写
bin: 指字符比较时采用二进制进行比较。当将字符串视为二进制进行比较时，会将字符串中的每个字符视为一组字节（或比特）的序列，而不考虑字符的语义、语言、大小写或重音符号等特征。在进行二进制比较时，只比较字符的字节表示，而不考虑字符本身的含义。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/笔触狂放9/article/detail/858535