mysql group replication

简介

GR是个mysql插件,通过原子广播协议、乐观事务冲突检测实现了高可用的多master集群
每个master都有全量数据,client side load balance write workload或者使用ProxySQL
读事务都是本地执行的
有2种模式

  • 单主,自动选主
  • 多主,active active master

与PXC是完全的竞争产品

Requirements and Limitations

  • InnoDB engine only, rollback uncommitted changes
  • turn on binlog RBR
  • GTID enabled
  • each table MUST have a primary key或者not null unique key
  • no concurrent DDL
  • 至少3台master,至多9台,不需要slave
  • auto_increment字段通过offset把各个master隔离开,避免冲突
  • cascading foreign key not supported
  • 只是校验write set,serializable isolation NOT supported
  • 存在stale read问题,如果write/read不在一台member
  • savepoints可能有问题

Performance

http://mysqlhighavailability.com/an-overview-of-the-group-replication-performance/

80% throughput of a standalone MySQL server

Internals

GR

XCOM

eXtended COMmunications,一个Paxos系统

  • 确保消息在所有member上相同顺序分发
  • 动态成员,成员失效检测

理论基础 Database State Machine

事务的Update操作都在一个成员上执行,在Commit时把write-set以total order发送消息给每个成员;
每个成员上的certification进程检查事务冲突(first commit wins),完成最终提交或回滚

Commit时的Paxos有2个作用

  • certification,检测事务冲突
  • propagate

Group Replication ensures that a transaction only commits after a majority of the members in a group have received it
and agreed on the relative order between all transactions that were sent concurrently.

与multi-paxos不同,XCOM是multi-leader/multi-proposer:每个member都是leader of its own slots

Certification

group_replication_group_name就是GTID里的UUID
GTID就是database version
mysql> select @@global.gtid_executed

transaction write set: [{updated_row_pk: GTID_EXECUTED}, {updated_row_pk: GTID_EXECUTED}, …]

GTID是由certification模块负责的,由它来负责GTID GNO的increment
所有member会定期交换GTID_EXECUTED,所有member已经committed事务的交集:Stable Set.

Transaction

async apply
async apply

Distributed Recovery

向group增加新成员的过程: 获取missing data,同时cache正在发生的新事务,最后catch up

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// 从现有member里通过mysql backup工具(mysqldump等)搞个backup instance
// phase 0: join
Joiner.join(), 通过total order broadcast发给每个member
生成view change binlog event: $viewID
group里每个member(包括Joiner)都会收到该view event
每个online member会把该binlog event排队到现有transaction queue里
// phase 1: row copy
Joiner pick a live member from the group as Donor // Donor可能会有lag
Doner transmits all data up to the joining moment: master/slave connection
for {
if binlog.event.view_id == $viewID {
Joiner.SQLThread.stop()
break
}
if Doner.dies {
reselect Donner
goto restart
}
}
// phase 2: catch up
joining moment后发生的binlogDonor发给Joiner,Joiner apply
catch up同步完成后,declare Joiner online,开始对外服务
// Joiner.leave()类似的过程
// crash过程会被detector发现,自动执行Joiner.leave()

这个过程与mysql在线alter table设计原理类似

binlog view change markers

group里member变化,会产生一种新的binlog event: view change log event.
view id就是一种logicl clock,在member变化时inrement

1
2
3
+-----------------+
| epoch | counter |
+-----------------+

epoch在第一个加入group的member生成,作用是为了解决all members crash问题: avoid dup counter

certification based replication

通过group communication和total order transaction实现synchronous replication

事务在单节点乐观运行,在commit时,通过广播和冲突检测实现全局数据一致性
它需要

  • transactional database来rollback uncommitted changes
  • primary keys to generate broadcast write-set
  • atomic changes
  • global ordering replication events

certificationbasedreplication

Config

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[mysqld]
log-bin
binlog-format=row
binlog-checksum=NONE
gtid-mode=ON
enforce-gtid-consistency
log-slave-updates
master-info-repository=TABLE
relay-log-info-repository=TABLE
transaction-write-set-extraction=MURMUR32
// GR
group_replication_group_name="da7bad5b-daed-da7a-ba44-da7aba5e7ab"
group_replication_local_address="host2:24901"
group_replication_group_seeds="host1:24901,host2:24901,host3:24901"

FAQ

GR是同步还是异步?

replication分为5步

1
2
3
4
5
master locally apply
master generate binlog event
master sending the event to slave(s)
slave IO thread add event to relay log
slave SQL thread apply the event from relay log

GR下,只有3是同步的: 把write set广播并得到majority certify confirm
广播时发送消息是同步的,但apply write set还是异步的:

1
2
3
4
5
6
7
8
9
10
member1: DELETE FROM a; // a table with many rows
member1: 产生一个非常大的binlog event
member1: group communicate the binlog event to all members(包括它自己)
其他member确认ok,那么member1就返回ok给client
client访问member1,那么数据是一致的
但其他member在异步apply binlog event,可能花很长时间,这时候client访问member2,可能不一致:
delete的数据仍然能读出来

async apply

References

http://lefred.be/content/mysql-group-replication-about-ack-from-majority/
http://lefred.be/content/mysql-group-replication-synchronous-or-asynchronous-replication/
http://lefred.be/content/galera-replication-demystified-how-does-it-work/
http://www.tocker.ca/2014/12/30/an-easy-way-to-describe-mysqls-binary-log-group-commit.html
http://mysqlhighavailability.com/tag/mysql-group-replication/
http://mysqlhighavailability.com/mysql-group-replication-transaction-life-cycle-explained/

Share Comments

mysql在线alter table设计

主要逻辑

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// 合法性检查
// 包括:是否有外键、用户权限、表是否合法、是否有唯一键等
// 创建变更记录表
CREATE /* changelog table */ TABLE _tbs_c
// 创建影子表
CREATE /* shadow table */ TABLE _tbl_s LIKE tbl
// 在影子表上应用alter语句
ALTER TABLE _tbl_s STATEMENT
// 开始行拷贝线程 tbl -> _tbl_s
// 开始binlog接收和应用线程 binlog -> _tbl_s
// 等待行拷贝线程完成
// 通知binlog线程收工
// 等待binlog线程结束
// 开始切换
LOCK TABLES tbl WRITE
RENAME TABLE tbl TO _tbl_old, _tbl_s TO tbl
UNLOCK TABLES

确定行拷贝chunk范围

1
2
3
select id from
(select id from a where id>=0 and id<=3001 order by id asc limit 1000) select_osc_chunk
order by id desc limit 1;

行拷贝in chunk

1
2
3
4
5
6
7
begin;
insert ignore into `a`.`_a_gho` (`id`, `value`)
(select `id`, `value` from `a`.`a` force index (`PRIMARY`)
where (((`id` > ?) or ((`id` = ?))) and ((`id` < ?) or ((`id` = ?))))
lock in share mode
)
commit;

关键点

async binlog worker如何判断所有数据变更已经完成

binlog worker向changelog table发一行记录,在收到这个记录时,即表示完成

RENAME race condition with DML

mysql内部保证,LOCK TABLE后,如果有DML与RENAME并发操作,那么在UNLOCK TABLES时,RENAME
一定获取最高优先级,即:RENAME一定会先执行。
否则,会丢数据:

1
2
3
4
5
6
7
LOCK TABLE WRITE
INSERT // blocked
RENAME // blocked
UNLOCK TABLE
INERT // 如果INSERT先执行,那么它会插入原表
RENAME // 原表被rename to tbl_old,刚才INSERT的数据丢失: 存放在了tbl_old

LOCK, RENAME

如果在一个mysql连接内执行LOCK; RENAME,那么会失败
解决办法:创建2个mysql连接,分别执行LOCK和RENAME

Share Comments

两段锁 2PL

事务开始后就处于加锁阶段,一直到执行ROLLBACK和COMMIT之前都是加锁阶段。ROLLBACK和COMMIT使事务进入解锁阶段

事务遵守两段锁协议是可串行化调度的充分条件,而不是必要条件

MS SQL Server默认采用2PL实现isolation,Oralce/PostgreSQL/MySQL InnoDB默认使用MVCC

MySQL 2PL提供2种锁

  • shared(read)
  • exclusive(write)

读写互斥,但读读不互斥

MySQL serialized isolation
2pl

Share Comments