赞
踩
对于MySQL的主从复制,首先我们需要知道以下几点:
1)主库开启log-bin,binlog会记录主库所有变更操作,该日志是主从复制的核心日志。
2)主/从库server-id不一致,server-id是数据库的唯一标识,若主从数据库server-id一致将无法创建主从复制关系
1)当两个数据库构建成为主从复制关系时,从库会启动IO线程和SQL线程;主库启动相关的dump线程。
2)主库发生的所有变更操作都会记录到binlog日志中
3)当主库发生变更时,主库的dump线程会通知从库的IO线程,IO线程根据具体的binlog文件以及位点信息将对应的日志写入到从库的relay log中;
4)从库的SQL线程解析应用relay log日志进行回放,以保证从库数据与主库一致
mysql> show slave status\G
重点关注参数:
Master_Log_File: IO线程读取到的binlog file文件
Read_Master_Log_Pos: IO线程读取到的binlog file文件的位点信息
Relay_Log_File: SQL线程当前正在应用的relay log文件
Relay_Log_Pos: SQL线程当前应用relay log文件的位点信息
Relay_Master_Log_File: SQL线程当前应用记录对应的主库binlog file文件
Exec_Master_Log_Pos: SQL线程当前应用记录对应主库binlog file文件的位点信息
Slave_IO_Running: IO线程状态
Slave_SQL_Running: SQL线程状态
Last_SQL_Errno: 主从复制中断报错编码
Last_SQL_Error: 主从复制报错具体信息
Seconds_Behind_Master: 一定程度上反应了主从复制延迟
Master_UUID: 主库的uuid
Retrieved_Gtid_Set: 当前IO线程读取到的gtid集
Executed_Gtid_Set: 当前SQL线程应用到的gtid集
mysql> show variables like 'log_error';
+---------------+------------------------------+
| Variable_name | Value |
+---------------+------------------------------+
| log_error | /data/mysql57/logs/error.log | //在数据库中查看error log路径
+---------------+------------------------------+
1 row in set (0.00 sec)
# dmesg -T
1、具体报错
root@mysql57 17:17: [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Log_File: binlog.000006 Read_Master_Log_Pos: 230 Relay_Log_File: relaylog.000002 Relay_Log_Pos: 2024 Relay_Master_Log_File: binlog.000005 Slave_IO_Running: Yes Slave_SQL_Running: No Exec_Master_Log_Pos: 1905 Relay_Log_Space: 3518 Last_SQL_Errno: 1032 Last_SQL_Error: Could not execute Update_rows event on table db1.t1; Unknown error 1032, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000005, end_log_pos 2126 Master_Server_Id: 33061 Master_UUID: 1d24c492-83eb-11ea-86cd-000c2913f5b2 Master_Info_File: mysql.slave_master_info Retrieved_Gtid_Set: 1d24c492-83eb-11ea-86cd-000c2913f5b2:5-12 Executed_Gtid_Set: 1d24c492-83eb-11ea-86cd-000c2913f5b2:1-11, 346cd00d-8ea9-41b7-8fea-67a6a483470f:1-24, 66c8918d-83eb-11ea-9d7d-000c29242ae2:1-2 Auto_Position: 1 1 row in set (0.00 sec)
2、问题排查
从Last_SQL_Errno、Last_SQL_Error中我们可以看到,导致主从复制中断是因为无法执行一个UPDATE操作,所以我们首先需要根据主从复制以及报错信息的BINLOG FILE以及POSITION位点信息,找到对应的UPDATE具体语句,并且在数据库中排查具体什么原因导致该语句执行报错。
1)定位问题SQL
根据主从复制报错提供的信息,我们有两种方法去定位具体的慢SQL。方法一是根据Relay_Master_Log_File、Exec_Master_Log_Pos、Last_SQL_Error提供出来的信息,在master的日志目录下通过位点找到对应SQL;方法二是根据Relay_Log_File、Relay_Log_Pos在slave的日志目录下通过位点找到对应SQL。
# 通master寻找问题SQL
# /usr/local/mysql57/bin/mysqlbinlog -vv binlog.000005 --start-position=1905 --stop-position=2126
### UPDATE `db1`.`t1`
### WHERE
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2=NULL /* VARSTRING(80) meta=80 nullable=1 is_null=1 */
### @3=NULL /* INT meta=0 nullable=1 is_null=1 */
### SET
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2='cc' /* VARSTRING(80) meta=80 nullable=1 is_null=0 */
### @3=23 /* INT meta=0 nullable=1 is_null=0 */
2)在复制报错的slave端查看具体表错的表数据,可以发现该update语句更新的记录在slave中根本不存在,从而导致update操作报错。slave若出现该类型相关的报错,一般都是由于主从数据不一致,导致主库binlog传输到从库,从库应用时无法匹配到对应记录而报错。
root@mysql57 23:00: [db1]> select * from `db1`.`t1`;
+----+------+------+
| id | name | age |
+----+------+------+
| 2 | aa | 12 |
| 3 | bb | 14 |
| 4 | aa | 12 |
+----+------+------+
3 rows in set (0.00 sec)
3、问题处理
对于主从数据不一致而产生的复制中断,我们可以根据具体报错场景判断是否可以直接跳过报错还是跳过报错后仍然需要手动订正相关记录。本次复制报错中我们需要做的就是跳过报错,手动订正该记录,保证主从数据一致性。
mysql> stop slave;
mysql> set gtid_next = '1d24c492-83eb-11ea-86cd-000c2913f5b2:12';
mysql> begin;commit;
mysql> set gtid_next = 'AUTOMATIC';
# 重新开启主从同步并检查复制状态
mysql> start slave;
mysql> show slave status\G
# 订正数据
mysql> set sql_log_bin=0;
mysql> insert into t1 values(1,'cc',23);
mysql> set sql_log_bin=1;
1、具体报错
mysql> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Log_File: mysql-bin.000055 Read_Master_Log_Pos: 541873929 Relay_Log_File: relaylog.000177 Relay_Log_Pos: 719899849 Relay_Master_Log_File: mysql-bin.000053 Slave_IO_Running: Yes Slave_SQL_Running: No Exec_Master_Log_Pos: 719899639 Relay_Log_Space: 45383924486 Last_SQL_Errno: 1396 Last_SQL_Error: Error 'Operation CREATE USER failed for 'brc_acrm'@'%'' on query. Default database: ''. Query: 'CREATE USER 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7'' Master_UUID: 42a444a1-c303-11e8-952b-005056901fca Master_Info_File: mysql.slave_master_info Retrieved_Gtid_Set: 42a444a1-c303-11e8-952b-005056901fca:1-177094 Executed_Gtid_Set: 42a444a1-c303-11e8-952b-005056901fca:1-172612, d8ea4112-c303-11e8-952f-00505690f262:1-12 Auto_Position: 1 1 row in set (0.00 sec)
2、问题排查
从Last_SQL_Errno以及Last_SQL_Error中可以看到具体执行报错操作是一个创建用户的相关动作,根据该报错我们推断去查看mysql.user表中是否可以排查到一些有用的线索。
从以下SQL结果中我们可以看到,主库binlog传输过来一个创建brc_acrm@%用户的操作,但是从库此时已经存在brc_acrm@%用户的信息,所以导致该create user操作执行报错,主从复制中断原因基本定位
mysql> select user,host from mysql.user where user='brc_acrm';
+----------+-----------+
| user | host |
+----------+-----------+
| brc_acrm | % |
+----------+-----------+
1 rows in set (0.00 sec)
3、问题处理
1)对比主从数据库下该用户的授权情况,从授权来看目前主从库是完全一致的,猜测可能是开发同学通过超级账号手动在主/从库创建用户导致。
主: mysql> show grants for 'brc_acrm'@'%'; +---------------------------------------------------------------------------------------------------------+ | Grants for brc_acrm@% | +---------------------------------------------------------------------------------------------------------+ | GRANT USAGE ON *.* TO 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7' | +---------------------------------------------------------------------------------------------------------+ 1 row in set (0.04 sec) 从: mysql> show grants for 'brc_acrm'@'%'; +---------------------------------------------------------------------------------------------------------+ | Grants for brc_acrm@% | +---------------------------------------------------------------------------------------------------------+ | GRANT USAGE ON *.* TO 'brc_acrm'@'%' IDENTIFIED BY PASSWORD '*A41CBDC67109B86378A51F6FC37A24333BAEA9E7' | +---------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
2)针对以上情况,我们可以选择跳过当前GTID事务来恢复主从复制,具体操作如下:
# 关闭主从同步
mysql> stop slave;
# 跳过错误
mysql> set gtid_next = '42a444a1-c303-11e8-952b-005056901fca:172613';
mysql> begin;commit;
mysql> set gtid_next = 'AUTOMATIC';
# 重新开启主从同步并检查复制状态
mysql> start slave;
mysql> show slave status\G
1、具体报错
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Log_File: mysql-bin.000068 Read_Master_Log_Pos: 230 Relay_Log_File: relay-log.000178 Relay_Log_Pos: 16477419 Relay_Master_Log_File: mysql-bin.000064 Slave_IO_Running: Yes Slave_SQL_Running: No Exec_Master_Log_Pos: 240368941 Relay_Log_Space: 18473673 Last_SQL_Errno: 1418 Last_SQL_Error: Error 'This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declaration and binary logging is enabled (you *might* want to use the less safe log_bin_trust_function_creators variable)' on query. Default database: 'electric'. Query: 'CREATE DEFINER=`ydan-user`@`%` FUNCTION `GetAccess`(id varchar(50)) RETURNS varchar(50) CHARSET utf8 BEGIN declare num int; declare result varchar(10); select count(*) into num from ou_userrole where userid=id and ROLEID like 'leader%' ; if num>0 then set result='access'; else set result ='no'; end if; return result; END' Master_Server_Id: 85857738 Master_UUID: d3c62cfa-7e24-11ea-8e2e-faf8ce492e00 Master_Info_File: mysql.slave_master_info Retrieved_Gtid_Set: d3c62cfa-7e24-11ea-8e2e-faf8ce492e00:1-1982 Executed_Gtid_Set: d3bf96b2-7e24-11ea-8fb1-fa38f0896800:1-10708449, d3c62cfa-7e24-11ea-8e2e-faf8ce492e00:1-688 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec)
2、问题排查
从Last_SQL_Errno、Last_SQL_Error中可以看到,本次主从同步中断具体报错的是一个函数创建的报错。由于master与slave都是开启了log-bin参数的,在该情况下master、slave中的函数对数据一致性存在一定的隐患,所以在不明确函数是否会影响数据一致性的情况下,MySQL默认会阻止这类操作。
mysql> show variables like 'log_bin_trust_function_creators';
+---------------------------------+-------+
| Variable_name | Value |
+---------------------------------+-------+
| log_bin_trust_function_creators | OFF | //该参数默认为OFF,表示开启log-bin的slave阻止创建未明确指定类型的函数
+---------------------------------+-------+
1 row in set (0.01 sec)
# 函数创建指定类型有
DETERMINISTIC //不确定是否会影响数据一致性,
NO SQL //没有SQl语句,保证不影响数据一致性
READS SQL DATA //只是读取数据,保证不影响数据一致性
3、问题处理
mysql> stop slave;
mysql> set sql_log_bin=0;
mysql> set global log_bin_trust_function_creators=TRUE;
mysql> 手动执行对应的存储过程
mysql> set global log_bin_trust_function_creators=0;
mysql> set sql_log_bin=1;
1、具体报错
1) MySQL 复制信息
mysql> show slave status\G *************************** 1. row *************************** Master_Log_File: mysql-bin.000400 Read_Master_Log_Pos: 363471322 Relay_Log_File: relay-log.000236 Relay_Log_Pos: 197156629 Relay_Master_Log_File: mysql-bin.000397 Slave_IO_Running: Yes Slave_SQL_Running: No Exec_Master_Log_Pos: 197156408 Relay_Log_Space: 1936343145 Last_SQL_Errno: 1594 Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Replicate_Ignore_Server_Ids: Master_Server_Id: 19670500 Master_UUID: 5d4178f0-d6eb-11e9-bf91-0242f977238f Master_Info_File: mysql.slave_master_info Retrieved_Gtid_Set: 5d4178f0-d6eb-11e9-bf91-0242f977238f:398658257-399871744:399871746-496665236, 5e07d619-d6eb-11e9-beff-0242e77c7e67:145857044-146322581 Executed_Gtid_Set: 012ef0fc-4ae3-11e9-8406-28a6db6245e4:1-31634397, 5d4178f0-d6eb-11e9-bf91-0242f977238f:1-491760413, 5e07d619-d6eb-11e9-beff-0242e77c7e67:1-146297997, e807c5a5-4ae2-11e9-845f-2c55d3e93d14:1-50267012 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec)
2)error log
2020-07-27T02:56:05.036042Z 2 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 100, event_type: 31 2020-07-27T02:56:05.036069Z 2 [ERROR] Error reading relay log event for channel '': slave SQL thread aborted because of I/O error 2020-07-27T02:56:05.036096Z 2 [ERROR] Slave SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (y ou can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error _code: 1594 2020-07-27T02:56:05.036109Z 2 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000397' position 197156408. 2020-07-29T15:43:29.114678Z 74970 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information. 2020-07-29T15:43:29.686895Z 74970 [ERROR] mysqld: Binary logging not possible. Message: Either disk is full or file system is read only while rotating the binlog. Aborting the server. 15:43:29 UTC - mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. Attempting to collect some information that could help diagnose the problem. As this is a crash and something is definitely wrong, the information collection process might fail.
2、问题排查
1)从主从复制信息中我们可以发现复制中断是因为SQL线程无法解析relay log,relay可能由于某些原因导致损坏,对于这种情况我们可以手动通过mysqlbinlog工具及主从复制中具体的报错位点信息对relay log进行解析,确认该报错情况。
2)继续查看error log,从error log中我们可以发现本次导致relay log损坏的原因可能是由于I/O
3)一般可能会导致relay log损坏的情况有:数据库重启、网络抖动、空间不足日志无法写入等
3、问题处理
对于relay log无法正常解析的报错,我们需要重新指定主从复制来进行恢复。若主从复制使用的是非GTID模式可以根据error log提示的位点信息重新change master,若使用的是GTID模式,那我们只需要指定MASTER_AUTO_POSITION=1即可,具体操作如下:
mysql> stop slave;
mysql> CHANGE MASTER TO
MASTER_HOST='172.26.44.1',
MASTER_USER='rds_repl',
MASTER_PASSWORD='xxx',
MASTER_PORT=3044,
MASTER_AUTO_POSITION=1;
mysql> start slave;
1、具体报错
1) MySQL 主从复制报错
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Log_File: mysql-bin.000402 Read_Master_Log_Pos: 311570312 Relay_Log_File: relay-log.000124 Relay_Log_Pos: 311570536 Relay_Master_Log_File: mysql-bin.000402 Slave_IO_Running: No Slave_SQL_Running: No Exec_Master_Log_Pos: 311570251 Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1872 Last_SQL_Error: Slave failed to initialize relay log info structure from the repository Master_UUID: 5d4178f0-d6eb-11e9-bf91-0242f977238f Master_Info_File: mysql.slave_master_info Retrieved_Gtid_Set: 5d4178f0-d6eb-11e9-bf91-0242f977238f:448797409-499521040, 5e07d619-d6eb-11e9-beff-0242e77c7e67:146080492-146337012 Executed_Gtid_Set: 012ef0fc-4ae3-11e9-8406-28a6db6245e4:1-31634397, 5d4178f0-d6eb-11e9-bf91-0242f977238f:1-499521040, 5e07d619-d6eb-11e9-beff-0242e77c7e67:1-146337012, d00323ab-8ce0-11e9-99bc-0242065ece34:1-32, e807c5a5-4ae2-11e9-845f-2c55d3e93d14:1-50267012 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec)
2)error log
2020-07-31T03:51:03.357467Z 0 [ERROR] Failed to open the relay log '/data/mysql/binlog/relay-log.000124' (relay_log_pos 311570536).
2020-07-31T03:51:03.357484Z 0 [ERROR] Could not find target log file mentioned in relay log info in the index file '/data/mysql/binlog/relay-log.index' during relay log initialization.
2020-07-31T03:51:03.409170Z 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2020-07-31T03:51:03.409211Z 0 [ERROR] Failed to create or recover replication info repositories.
2020-07-31T03:51:03.409247Z 0 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T03:51:03.409259Z 0 [ERROR] mysqld: Slave failed to initialize relay log info structure from the repository
2020-07-31T03:51:03.409267Z 0 [ERROR] Failed to start slave threads for channel ''
2020-07-31T04:55:43.905528Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T04:55:50.286848Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2020-07-31T04:57:22.006869Z 198 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
3)系统日志
Jul 31 11:51:31 dbinstance3 dockerd: time="2020-07-31T11:51:31.757931233+08:00" level=info msg="Container 53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681 failed to exit within 10 seconds of signal 15 - using the force"
Jul 31 11:51:34 dbinstance3 containerd: time="2020-07-31T11:51:34.110864629+08:00" level=info msg="shim reaped" id=53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681
Jul 31 11:51:34 dbinstance3 dockerd: time="2020-07-31T11:51:34.120638146+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 31 11:51:34 dbinstance3 containerd: time="2020-07-31T11:51:34.464111797+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/53feb8e40d8d4b7712675f1cef144c243a34b068e0193b568cb4e6de8cb7a681/shim.sock" debug=false pid=23184
2、问题排查
1)根据Last_SQL_Error以及errorlog日志,我们可以看到主从复制中断是由于无法打开relay-log.000124,并且在后续准备通过relay-log.index重新初始化relay log时初始化失败。
2)根据以上报错,我们进入到数据库的datadir下,检查确认relay-log.index文件,发现该文件的relog编号开始从1开始重新计数,relay-log.000124丢失
3)查看系统日志我们发现一个比较可以的docker失败退出的信息(数据库是通过docker容器启动),基于此报错我们通过docker ps查看该数据库实例的容器启动时间,发现启动时间与故障发生时间刚好吻合。但是由于该docker并没有输出相关日志,所以不方便我们继续排查什么原因导致relay log丢失。
3、问题处理
对于整个relay log丢失,relay log初始化失败的情况,同样我们需要重新指定主从复制来进行恢复。若非GTID模式需要指定具体的binlog文件以及位点信息;若GTID模式指定MASTER_AUTO_POSITION=1即可。
mysql> reset slave;
mysql> CHANGE MASTER TO
MASTER_HOST='172.26.44.1',
MASTER_USER='rds_repl',
MASTER_PASSWORD='xxx',
MASTER_PORT=3044,
MASTER_AUTO_POSITION=1;
mysql> start slave;
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。