Mysql:主从不同步故障

Posted by zhangshun on October 29, 2019

前言

今早晨看告警信息时发现,两台数据库主从断开了,我差点人没了
登录服务器,show slave status\G;
DB01:

1
2
3
4
5
mysql> show slave status\G;
					...... 
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table zeus.xinyan_cache; Duplicate entry 'b98e4fab203607559fc5347b60cb64a41cf50a79650fbda3a1a4be5a1cfd16d9' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql2-bin.000006, end_log_pos 560082858
					......

DB02:

1
2
3
4
5
MySQL [zeus]> show slave status\G;
					...... 
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table zeus.xinyan_cache; Duplicate entry 'b98e4fab203607559fc5347b60cb64a41cf50a79650fbda3a1a4be5a1cfd16d9' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql1-bin.000001, end_log_pos 10867425
					......

两台数据库做的双主架构,可以看出:
zeus.xinyan_cache表中插入数据时有主键冲突
DB01在同步DB02二进制日志的560082858位置时发生了1062主键冲突的报错
DB02在同步DB01二进制日志的10867425位置时发生了1062主键冲突的报错

查看mysql的error.log

1
2
3
2019-10-29T03:29:51.885616+08:00 11 [ERROR] Slave SQL for channel '': Could not execute Write_rows event on table zeus.xinyan_cache; Duplicate entry 'b98e4fab203607559fc5347b60cb64a41cf50a79650fbda3a1a4be5a1cfd16d9' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql2-bin.000006, end_log_pos 560082858, Error_code: 1062
2019-10-29T03:29:51.885651+08:00 11 [Warning] Slave: Duplicate entry 'b98e4fab203607559fc5347b60cb64a41cf50a79650fbda3a1a4be5a1cfd16d9' for key 'PRIMARY' Error_code: 1062
2019-10-29T03:29:51.885662+08:00 11 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql2-bin.000006' position 560082351

通过mysqlbinlog分析

Mysql的主从状态跟error.log都不能定位问题,So…只能使用mysqlbinlog来确定具体出错的sql
于是在DB01上执行mysqlbinlog /data/mysql/mysql1-bin.000001,得到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
......
# at 617747
#191026 14:34:01 server id 2  end_log_pos 617812 CRC32 0xeca51c85 	Table_map: `zeus`.`xinyan_cache` mapped to number 111
# at 617812
#191026 14:34:01 server id 2  end_log_pos 618042 CRC32 0xa20e87c9 	Write_rows: table id 111 flags: STMT_END_F

BINLOG '
2eizXRMCAAAAQQAAAFRtCQAAAG8AAAAAAAEABHpldXMADHhpbnlhbl9jYWNoZQAEDw8PEQcAAQAB
oA8ACIUcpew=
2eizXR4CAAAA5gAAADpuCQAAAG8AAAAAAAEAAgAE//AgAGI4NjA5NTI2MjIyOTkwNDhkZTYxNzkw
MzRiYmM4YzVkIAAzYjk1ZDM0ZDcyYjE5OWQyMTRjYzJlNTE2Mjg0NjRmYXgAeyJ6emNfY3VzdG9t
NSI6IjAiLCJ6emNfY3VzdG9tNiI6IjAuMDAiLCJ6emNfY3VzdG9tMyI6IjAuMDAiLCJ6emNfY3Vz
dG9tNCI6IjAiLCJ6emNfY3VzdG9tMSI6IjEiLCJ6emNfY3VzdG9tMiI6IjEwMC4wMCJ9XbPo2cmH
DqI=
'/*!*/;
......

查看mysql的binlog记录格式为mixed,sql全都被加密了,这完全看不了啊
执行mysqlbinlog --base64-output=decode-rows -v --start-datetime="2019-10-29 03:28:40" --stop-datetime="2019-10-29 03:30:19" /data/mysql/mysql1-bin.000001 > /home/zhangshun/binlog.sql,将mixed格式改为详细的raw格式输出,得到:
DB01中的binlog

1
2
3
4
5
6
7
8
9
10
11
# at 10867125	//事件的起点
#191029  3:29:51 server id 1  end_log_pos 10867425 CRC32 0x9ba82922     Write_rows: table id 111 flags: STMT_END_F
### INSERT INTO `zeus`.`xinyan_cache`
### SET
###   @1='b98e4fab203607559fc5347b60cb64a41cf50a79650fbda3a1a4be5a1cfd16d9'
###   @2='494d6793b9d3e54ecba8820a7db4c30c91bf9409b5420b1b7a54663743611fe3'
###   @3='{"zzc_custom5":"1","zzc_custom6":"115.18","zzc_custom3":"2847.77","zzc_custom4":"1","zzc_custom1":"4","zzc_custom2":"3600.00"}'
###   @4=1572290991		//sql语句
# at 10867425
#191029  3:29:51 server id 1  end_log_pos 10867456 CRC32 0x229114d9     Xid = 1834157	//执行时间,及位置戳,Xid:事件指示提交的XA事务
COMMIT/*!*/;

DB02中的binlog

1
2
3
4
5
6
7
8
9
10
11
# at 560082558	//事件的起点
#191029  3:29:51 server id 2  end_log_pos 560082858 CRC32 0x680e4503    Write_rows: table id 226 flags: STMT_END_F
### INSERT INTO `zeus`.`xinyan_cache`
### SET
###   @1='b98e4fab203607559fc5347b60cb64a41cf50a79650fbda3a1a4be5a1cfd16d9'
###   @2='494d6793b9d3e54ecba8820a7db4c30c91bf9409b5420b1b7a54663743611fe3'
###   @3='{"zzc_custom5":"1","zzc_custom6":"115.18","zzc_custom3":"2847.77","zzc_custom4":"1","zzc_custom1":"4","zzc_custom2":"3600.00"}'
###   @4=1572290991		//sql语句
# at 560082858
#191029  3:29:51 server id 2  end_log_pos 560082889 CRC32 0xc0a5040a    Xid = 17690148	//执行时间,及位置戳,Xid:事件指示提交的XA事务
COMMIT/*!*/;

在DB01跟DB02的binlog中,两条insert语句完全一致,在xinyan_cache表中:pid跟mobile是主键
也就是说DB01在同步DB02的binlog时,表里面已经存在相同的数据,反之DB02也一样

为什么两台数据库的binlog中会出现相同的insert呢?

去应用服务器上查日志发现,在凌晨3点的时候有客户连续发送了两条相同pid的请求,两台数据库做的jdbc:loadblance,恰巧两条相同的sql分别落到了DB01、DB02

由于极短时间内Mysql主从未来得及同步,所以才会出现主键冲突报错

解决方案

通过内部讨论,xinyan_cache表对数据要求没那么严格,出现这种情况的时候2条sql都是相同的,可能是客户端加了retry环节,所以只需插入一条即可。并且该库其他表都是自增id作为主键,不会出现主键冲突,初步确定的解决方案为:

修改my.cnf,当出现主键冲突1062错误码时自动跳过

1
2
3
4
vim /etc/my.cnf
...
slave_skip_errors=1062
...