InnoDB MVCC

2017-05-16

database

Basic

InnoDB中通过Undo log实现了txn rollback和MVCC，而并发控制(isolation)通过锁来实现
Undo log分为insert undo和update undo(delete是一种特殊的update)，回滚时

insert只要把insert undo log丢弃即可
update需要通过DB_ROLL_PTR DB_TRX_ID找到事务修改前的版本并恢复

与redo log不同的是，磁盘上不存在单独的undo log文件，所有的undo log均存放在主ibd数据文件中（表空间），即使客户端设置了每表一个数据文件也是如此

内部存储

InnoDB为每行row都reserved了隐藏字段(system column)

DB_ROW_ID
DB_TRX_ID
DB_ROLL_PTR

typedef ib_uint64_t ib_id_t;
typedef ib_id_t row_id_t;
typedef ib_id_t trx_id_t;
typedef ib_id_t roll_ptr_t;

syscol

Undo log实现方式

事务以排他锁的形式修改原始数据
把修改前的数据存放于undo log，通过回滚指针与主数据关联
修改成功（commit）啥都不做，失败则恢复undo log中的数据（rollback）

Demo

demo
demo
demo

Innodb中存在purge线程，它会查询那些比现在最老的活动事务还早的undo log，并删除它们

Issues

回滚成本

当事务正常提交时Innbod只需要更改事务状态为COMMIT即可，不需做其他额外的工作
而Rollback如果事务影响的行非常多，回滚则可能成本很高

write skew

InnoDB通过Undo log实现的MVCC在修改单行记录是没有问题的，但多行时就可能出问题

begin;
update table set col1=2 where id=1; // 成功，创建了undo log
update table set col2=3 where id=2; // 失败
rollback;

回滚row(id=1)时，由于它没有被lock，此时可能已经被另外一个txn给修改了，这个回滚会破坏已经commit的事务

如果要解决这个问题，需要应用层进行控制

另外一个例子

国家规定一个家庭最多养3只宠物(constraint)，Alice和Bob是一家人，他们现在有dog(1)+cat(1)
如果并发地，Alice再买一只猫，而Bob再买一只狗，这2个事务就会write skew
因为repeatable read：一个事务提交之后才visible by other txn
TxnAlice      TxnBob
cat=cat+1     dog=dog+1
这2个事务都成功了，但却破坏了constraint

锁

RR隔离级别下，普通select不加锁，使用MVCC进行一致性读取，即snapshot read
update, insert, delete, select … for update, select … lock in share mode都会进行加锁，并且读取的是当前版本: READ COMMITTED读
除了lock in share mode是S锁，其他都是X锁

References

https://dev.mysql.com/doc/refman/5.7/en/innodb-locks-set.html
https://blog.jcole.us/innodb/
http://jimgray.azurewebsites.net/WICS_99_TP/

microservice transaction

2017-05-16

一致性

References

https://docs.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction

zookeeper internals

2017-05-15

一致性

Basics

Model

File api without partial R/W
No rename operation

zab通过TCP+zxid实现事务的totally order
sequential consistency一致性模型，保证the real execution looks to clients like some sequential execution in which the operations of every client appear in the order they were submitted

Implementation

ZKDatabase

// zk的内存数据库
class ZKDatabase {
    DataTree dataTree
    LinkedList<Proposal> committedLog
    FileTxnSnapLog snapLog
    ConcurrentHashMap<Long, Integer> sessionsWithTimeouts // {sessionID: timeout}
}

DataTree

class DataTree {
    ConcurrentHashMap<String, DataNode> nodes // {path: znode}, flat
    ConcurrentHashMap<Long, HashSet<String>> ephemerals // {sessionID: [path, ]}
    WatchManager dataWatches, childWatches
    func createNode(path, data) {
        parent = nodes.get(parentPath)
        parent.Lock()
        // check NodeExistsException
        // set stat of the new znode and its parent
        child = new DataNode(parent, data, stat) // the new znode
        parent.addChild(child)
        nodes.put(path, child)
        if ephemeralOwner != nil {
            ephemerals.get(ephemeralOwner).add(path)
        }
        parent.Unlock()
        dataWatches.triggerWatch(path, NodeCreated)
        childWatches.triggerWatch(parentPath, NodeChildrenChanged)
    }
}

DataNode

class DataNode {
    DataNode parent
    Set<String> children
    StatPersisted stat
    []byte data
}

n(n-1)/2 conns

只允许id比较大的server发起主动连接：由于任意server在启动时都会主动向其他server发起连接，如果这样，任意两台server之间就拥有两条连接，这明显是没有必要的

======= ======= ======= ======= ======= ========
sid     1       2       3       4       5
======= ======= ======= ======= ======= ========
1       <>      <       <       <       <
2               <>      <       <       <
3                       <>      <       <
4                               <>      <
5                                       <>
======= ======= ======= ======= ======= ========

成为 leader 的条件

选epoch最大的
epoch相等，选 zxid 最大的
epoch和zxid都相等，选择server id最大的

(newEpoch > curEpoch) ||
	((newEpoch == curEpoch) &&
		((newZxid > curZxid) ||
			((newZxid == curZxid) && (newId > curId))))

何时选举

进入LOOKING状态

刚启动时
稳定运行中，任何的异常都会让本机进入LOOKING态
1
2
3
catch (Exception e) {
setPeerState(LOOKING)
}

2PC

是个简化的2PC，因为不存在abort/rollback，只有commit

传统2PC里coordinator crash的硬伤，zk是怎么解决的？
重新选举，recover from txnlog

class LearnerHandler extends Thread {
    queuedPackets = new LinkedBlockingQueue<QuorumPacket>()
}
class Leader {
    func processAck(long zxid) {
        Proposal p = outstandingProposals.get(zxid)
        p.ackSet.add(1)
        if p.ackSet.verifyQuorum() {
            // 大多数返回ack了
            outstandingProposals.remove(zxid)
            commit(zxid) {
                // 通知all followers
                QuorumPacket qp = new QuorumPacket(Leader.COMMIT, zxid, null, null)
                sendPacket(qp) // 异步
            }
            inform(p) {
                // 通知all observers
            }
        }
    }
    func propose(req Request) {
        Proposal p = new Proposal(req)
        outstandingProposals.put(zxid, p)
        sendPacket(p) {
            for LearnerHandler f = range followers {
                f.queuePacket(p)
            }
        }
    }
}
class Follower {
    func processPacket(QuorumPacket qp) {
        switch qp.type {
            case Leader.PROPOSAL:
                // proposal只是记录txnlog，不改变database
                FollowerZooKeeperServer.logRequest() {
                    pendingTxns.add(req)
                    txnLog.append(req)
                }
            case Leader.COMMIT:
                FollowerZooKeeperServer.commit(qp.zxid) {
                    commitProcessor.commit(pendingTxns.remove()) {
                        committedRequests.add(req) {
                            FinalRequestProcessor.processRequest(req) {
                                ZooKeeperServer.processTxn -> ZKDatabase().processTxn
                                根据req.type来创建response，并发送
                            }
                        }
                    }
                }
        }
    }
}

Constraints

Many ZooKeeper write requests are conditional in nature:

a znode can only be deleted if it does not have any children
a znode can be created with a name and a sequence number appended to it
a change to data will only be applied if it is at an expected version

Quorum

func isQuorum(type) {
    // zk的请求有2种 1. 事务请求 2. 只读请求
    switch (type) {
    case OpCode.exists, getACL, getChildren, getChildren2, getData:
        // 本地执行，不需要proposal
        return false
    case OpCode.error, closeSession, create, createSession, delete, setACL, setData, check, multi:
        return true
    default:
        return false
    }
}

注意：session部分，也会走txn

multi

是原子操作，multi里的每个op都使用相同的zxid

Watch

Watches are maintained locally at the ZooKeeper server to which the client is connected.
它是不走proposal quorum的

Watcher只会告诉客户端发生了什么类型的事件，而不会说明事件的具体内容
例如，NodeDataChanged，watcher只会通知client：
在你watch的path上，发生了NodeDataChanged这个事件
但最新的数据是什么，不在event里，而需要client主动重新去get

Watch的通知，由WatchManager完成，它先从内存里删除这个watcher，然后回调watcher.process
后者在NIOServerCnxn，即watches are sent asynchronously to watchers(client).
But ZooKeeper guarantees that a client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.

class WatchManager {
    HashMap<String, HashSet<Watcher>> watchTable
    func triggerWatch(path, event) {
        synchronized(this) { 
            watchers = watchTable.remove(path) // so one time trigger
        }
        // Watch机制本身是非常轻量级的，对服务器不会有多大开销：
        // 它都是local zk server在内存中处理
        // 但如果一个path的watcher很多，那么这个O(n)循环
        for watcher = range watchers {
            w.process(event)
        }
    }
}
func process(WatchedEvent event) {
    h = new ReplyHeader(-1, -1L, 0)
    sendResponse(h, event) // if IOException, close the client conn
                           // sock.write(非阻塞) async write
                           // sendResponse对同一个client是串行的，顺序的
}

顺序

client.get(path, watch=true)
// 此时数据发生变化
zk保证的顺序：
client先拿到watch event，之后才能看到最新的数据
client如果watch很多时间，那么得到这些event的顺序与server端发生的顺序是完全一致的

watch的保持

zk client连接zk1, 并get with watch，突然zk1 crash，client连接到zk2，那么watch
是如何保持的?
这是client端实现的，内存里记录watches，在pick new server后，sendSetWatches

watch的事件会丢吗

client2能获取到每次的set事件吗?

1 2	client1不停地set(node, newValue) client2 get with watch

不一定：因为是one time trigger
获取event后，要重新watch，在此过程中可能产生新的事件: 期间事件lost

此外，zk与client的conn断开后，client会连接下一个zk，在此期间的事件lost
例如，watch NodeCreated事件，在client重新连接期间，该node created，那么client将永远无法获取该事件

watch同一个znode多次，会收到几条event?

由于WatchManager的实现，相同类型的watch在一个path上被set多次，只会触发一次

1 2	create("/foo", EPHEMERAL_SEQUENTIAL) exists("/foo", watcher) // 那么这个watch事件是永远不会trigger的，因为path不同，见WatchManager的实现

Client

recvLoop里任意的错误，都会pick next server and authentication，进入新的循环

conn reset by peer
conn EOF
receive packet timeout
session expire

time

conn.recvTimeout = sessionTimeout * 2 / 3
ping interval = sessionTimeout / 3

例如，sessionTimeout=30s，那么client在等待20s还得不到response，就会try next server
恰好赶在ping的窗口期 10+20=30

Q/A

client ping是如何保持住session的?

client连接s1，定期ping，但s1 crash后client连接s2，为什么session能保持住?

connect(s2)
send(ConnectRequest{LastZxidSeen, SessionID}) // SessionID是s1当时分配的
var r ConnectResponse = recv()
if r.SessionID == 0 {
    // session expire
}

createSession会通过txn，因此client可以failover
server在sessionTimeout内没有收到ping，就会closeSession，它也通过txn

session id

zk session id assigned by server, global unique

func initializeNextSession(id=1) {
    long nextSid = 0;
    nextSid = (System.currentTimeMillis() << 24) >>> 8;
    nextSid =  nextSid | (id <<56);
    return nextSid;
}

后面的session id就是这个种子基础上 increment by 1

Snapshot

dataLogDir(txn log) and dataDir(snapshot) should be placed in 2 disk devices
如果txn log和snapshot处于同一块硬盘，异步的snapshot可能会block txn log，连锁反应就是把proposal阻塞，进而造成follower重新选举

when

System.getProperty(“zookeeper.snapCount”), 默认值100,000
takeSnapshot的时间在50,000 ~ 100,0000 之间的随机值
txn数量超过snapCount+随机数
- roll txn log
- 创建一个线程，异步执行takeSnapshot。但前面的takeSnapshot线程未完成，则放弃
  Too busy to snap, skipping

Request si = getRequest()
if (zks.getZKDatabase().append(si)) { // txn log ok
    logCount++; // logCount就是txn的数量
    if (logCount > (snapCount / 2 + randRoll)) {
        randRoll = r.nextInt(snapCount/2); // 为了防止集群内所有节点同时takeSnapshot加入随机
        zks.getZKDatabase().rollLog(); // txn log will roll
        if (snapInProcess != null && snapInProcess.isAlive()) {
            LOG.warn("Too busy to snap, skipping");
        } else {
            snapInProcess = new Thread("Snapshot Thread") {
                public void run() {
                    try {
                        zks.takeSnapshot();
                    } catch(Exception e) {
                        LOG.warn("Unexpected exception", e);
                    }
                }
            };
            snapInProcess.start();
        }
        logCount = 0;
    }
}

size

每个znode meta data至少76+path+data，如果1M znodes，平均size(path+data)=100，那么snapshot文件长度至少200MB
我的一个生产环境zk，znode 3万，snapshot文件15MB；即，如果300万个znodes，那么snapshot文件将是1.5GB

path(len4, path)
node
  data(len4, data)
  acl8
  meta60
    czxid8, mzxid8, ctime8, mtime8, version4, cversion4, aversion4, ephemeralOwner8, pzxid8

checksum

Adler32

Edge cases

leader election

LOOKING后，把自己的zxid广播，是得到大多数同意就成为leader?
是，不需要等全部ack

async commit

[S1(leader), S2, S3]
S1 Propose set(a)=5，在得到majority Ack(proposal)后，向所有机器发送Commit，
Q1. S1需要在得到majority Ack(commit)后才return OK to client?
Q2. 如果S1发送Commit给所有机器前恰好挂了，new leader会恢复这个事务吗？

leader在得到majority Ack(proposal)后，majority servers已经记录下了txnlog，leader发送Commit只是为了让servers
make the txn visibile to client，Commit消息是不会记录txnlog的
leader处理Commit是异步的，不需要等待Commit的ack，即Q1: no，Q2: yes

ZAB makes the guarantee that a proposal which has been logged by a quorum of followers will eventually be committed
any uncommited proposals from a previous epoch seen by a new leader will be committed by that leader before it becomes active.

换个角度看这个问题：
S1得到请求set(a)=5，commit locally，但commit还没有发送给S2，S3，crash!
这时候，一个client可能发请求 get(a)，如果它连接的是S1，在S1 crash前，get(a)=2
所以，这个commit必须让新leader知道

sync proposal

[S1(leader), S2, S3, S4, S5]
S1 Propose set(a)=2，发送给了S2 Proposal，但S3-5还没有收到Proposal，此时S1 crash，那么这个proposal在new leader
上会被恢复吗? 即重新选举后，get(a)=2?

不一定！

S1(leader), S2, S3, S4, S5
现在propose set(a)=b，S1确认了，但其他还没有确认，此时全部crash
然后启动S2-S5，等一会儿再启动S1，那么S2-S5他们的txid相同，会选出S5 leader
等S1启动时，它的txid是最大的，a=b可能会丢：
如果S1启动慢了200ms内，可能不会丢；否则，例如慢了1分钟，则丢了，S1变成follower后
会把该txid truncate: txnlog seek

FastLeaderElection.java

finalizeWait=200ms // 在得到majority确认后，但还没有得到全部确认，wait before make leader
sendNotifications()
for i.am.looking {
    n = recvqueue.poll()
    switch n.state {
        case LOOKING:
            compare my proposal with n and update my proposal
            if every node agrees {
                // got the leader!
                return
            }
            if quorum agrees {
                // Verify if there is any change in the proposed leader
                for {
                    n = recvqueue.poll(200ms)
                    if n == nil {
                        break
                    }
                }
            }
        case FOLLOWING, LEADING:
            if leader is not me {
                // confirm I have recv notification from leader
                // stating that he is leader
            }
    }
}

References

https://issues.apache.org/jira/browse/ZOOKEEPER-1813
https://issues.apache.org/jira/browse/ZOOKEEPER-417
https://issues.apache.org/jira/browse/ZOOKEEPER-1674
https://issues.apache.org/jira/browse/ZOOKEEPER-1642
http://blog.csdn.net/pwlazy/article/details/8080626

delay and schedule message delivery

2017-05-15

PubSub

使用场景

业务需要
通过它可以实现XA的prepare/commit/rollback，从而实现与其他系统的原子提交

实现

kateway

通过mysql作为WAL，并通过background worker(actor)来实现调度/commit/rollback

优先队列

以message due time作为优先级进行存储，配合worker
message rollback可以通过发送一个tombstone message实现
但由于worker的async，无法在rollback时判断是否真正rollback成功：
一条消息要5分钟后发送，在5分钟到达时，client可能恰好要取消，这时候，rollback与worker
之间存在race condition，需要正常处理这个一致性：
要么，取消失败，消息被发出
要么，取消成功，消息不发出
不能，取消成功，消息被发出

// worker
for {
    if msg := peek(queue); msg.due() {
        msg = pop(queue)
        if msg.isTombstone() {
            // the msg is cancelled
        } else {
            publish(msg)
        }
    }
}

mysql repeatable read write skew

2017-05-12

database

Isolation

教科书里的4种isolation

read uncommitted: 即dirty read，可能读到其他rollback的数据
read committed: 即non-repeatable read，同一个txn内读一条数据多次，结果可能不同
repeatable read: 一个txn内读一条数据多次结果相同，但不保证读多个数据的时候也相同(phantom read)
serialized

但它只是从锁的实现来描述的，不适用于MVCC。
各个数据库产品，虽然采用了这些isolation名字，但语义各不相同，很多与教科书里的定义不符

MySQL

不会出现phantom read。
MySQL里的RR其实是Snapshot Isolation，只有Serialized是完全基于锁

PostgreSQL

实际上只有2个隔离级别：Read Committed和Serialized
而Serialized是基于MVCC的无锁实现，即Serialized Snapshot

MVCC Snapshot

存在write skew问题

甲在一个银行有两张信用卡，分别是A和B。银行给这两张卡总的信用额度是2000，即A透支的额度和B透支的额度相加必须不大于2000：A+B<=2000。

A账号扣款

begin;
a = select credit from a
b = select credit from b
if (a + b) + amount <= 2000 {
    update a set credit = credit + amount
}
commit

B账号扣款

begin;
a = select credit from a
b = select credit from b
if (a + b) + amount <= 2000 {
    update b set credit = credit + amount
}
commit

假设现在credit(a)=1000, credit(b)=500, 1500<=2000
甲同时用a账号消费400，b账号消费300
在mysql RR下，2个事务都成功，但2个事务结束后
credit(a)=1400, credit(b)=700, 2100>2000

如果是serialized隔离级别，则没有问题：一个事务会失败

在mysql RR下，可以通过应用层加约束来避免write skew

结论for mysql

不能期望加了一个事务就万事大吉，而要了解每种隔离级别的语义。

涉及单行数据事务的话，只要 Read Committed ＋乐观锁就足够保证不丢写
涉及多行数据的事务的话，Serializable 隔离环境的话不会出错，但是你不会开
如果开 Repeatable Read （Snapshot）隔离级别，那么可能会因为 Write Skew 而丢掉写

如果是金融业务，尽量不要用MySQL

dual write conflict

2017-05-12

一致性

Issues

race condition
partial failure

conflicts

shard scales

2017-05-12

algorithm

sharding

mysql group replication

2017-05-11

database

简介

GR是个mysql插件，通过原子广播协议、乐观事务冲突检测实现了高可用的多master集群
每个master都有全量数据，client side load balance write workload或者使用ProxySQL
读事务都是本地执行的
有2种模式

单主，自动选主
多主，active active master

与PXC是完全的竞争产品

Requirements and Limitations

InnoDB engine only, rollback uncommitted changes
turn on binlog RBR
GTID enabled
each table MUST have a primary key或者not null unique key
no concurrent DDL
至少3台master，至多9台，不需要slave
auto_increment字段通过offset把各个master隔离开，避免冲突
cascading foreign key not supported
只是校验write set，serializable isolation NOT supported
存在stale read问题，如果write/read不在一台member
savepoints可能有问题

Performance

http://mysqlhighavailability.com/an-overview-of-the-group-replication-performance/

80% throughput of a standalone MySQL server

Internals

XCOM

eXtended COMmunications，一个Paxos系统

确保消息在所有member上相同顺序分发
动态成员，成员失效检测

理论基础 Database State Machine

事务的Update操作都在一个成员上执行，在Commit时把write-set以total order发送消息给每个成员；
每个成员上的certification进程检查事务冲突(first commit wins)，完成最终提交或回滚

Commit时的Paxos有2个作用

certification，检测事务冲突
propagate

Group Replication ensures that a transaction only commits after a majority of the members in a group have received it
and agreed on the relative order between all transactions that were sent concurrently.

与multi-paxos不同，XCOM是multi-leader/multi-proposer：每个member都是leader of its own slots

Certification

group_replication_group_name就是GTID里的UUID
GTID就是database version
mysql> select @@global.gtid_executed

transaction write set: [{updated_row_pk: GTID_EXECUTED}, {updated_row_pk: GTID_EXECUTED}, …]

GTID是由certification模块负责的，由它来负责GTID GNO的increment
所有member会定期交换GTID_EXECUTED，所有member已经committed事务的交集：Stable Set.

Transaction

async apply

Distributed Recovery

向group增加新成员的过程: 获取missing data，同时cache正在发生的新事务，最后catch up

// 从现有member里通过mysql backup工具(mysqldump等)搞个backup instance
// phase 0: join
Joiner.join(), 通过total order broadcast发给每个member
生成view change binlog event: $viewID
group里每个member(包括Joiner)都会收到该view event
每个online member会把该binlog event排队到现有transaction queue里
// phase 1: row copy
Joiner pick a live member from the group as Donor // Donor可能会有lag
Doner transmits all data up to the joining moment: master/slave connection
for {
    if binlog.event.view_id == $viewID {
        Joiner.SQLThread.stop()
        break
    }
    if Doner.dies {
        reselect Donner
        goto restart
    }
}
// phase 2: catch up
joining moment后发生的binlogDonor发给Joiner，Joiner apply
catch up同步完成后，declare Joiner online，开始对外服务
// Joiner.leave()类似的过程
// crash过程会被detector发现，自动执行Joiner.leave()

这个过程与mysql在线alter table设计原理类似

binlog view change markers

group里member变化，会产生一种新的binlog event: view change log event.
view id就是一种logicl clock，在member变化时inrement

1
2
3

+-----------------+
| epoch | counter |
+-----------------+

epoch在第一个加入group的member生成，作用是为了解决all members crash问题: avoid dup counter

certification based replication

通过group communication和total order transaction实现synchronous replication

事务在单节点乐观运行，在commit时，通过广播和冲突检测实现全局数据一致性
它需要

transactional database来rollback uncommitted changes
primary keys to generate broadcast write-set
atomic changes
global ordering replication events

certificationbasedreplication

Config

[mysqld]
log-bin
binlog-format=row
binlog-checksum=NONE
gtid-mode=ON
enforce-gtid-consistency
log-slave-updates
master-info-repository=TABLE
relay-log-info-repository=TABLE
transaction-write-set-extraction=MURMUR32
// GR
group_replication_group_name="da7bad5b-daed-da7a-ba44-da7aba5e7ab"
group_replication_local_address="host2:24901"
group_replication_group_seeds="host1:24901,host2:24901,host3:24901"

FAQ

GR是同步还是异步？

replication分为5步

master locally apply
master generate binlog event
master sending the event to slave(s)
slave IO thread add event to relay log
slave SQL thread apply the event from relay log

GR下，只有3是同步的: 把write set广播并得到majority certify confirm
广播时发送消息是同步的，但apply write set还是异步的:

member1: DELETE FROM a; // a table with many rows
member1: 产生一个非常大的binlog event
member1: group communicate the binlog event to all members(包括它自己)
其他member确认ok，那么member1就返回ok给client
client访问member1，那么数据是一致的
但其他member在异步apply binlog event，可能花很长时间，这时候client访问member2，可能不一致：
delete的数据仍然能读出来

async apply

References

http://lefred.be/content/mysql-group-replication-about-ack-from-majority/
http://lefred.be/content/mysql-group-replication-synchronous-or-asynchronous-replication/
http://lefred.be/content/galera-replication-demystified-how-does-it-work/
http://www.tocker.ca/2014/12/30/an-easy-way-to-describe-mysqls-binary-log-group-commit.html
http://mysqlhighavailability.com/tag/mysql-group-replication/
http://mysqlhighavailability.com/mysql-group-replication-transaction-life-cycle-explained/

mysql在线alter table设计

2017-05-11

database

主要逻辑

// 合法性检查
// 包括：是否有外键、用户权限、表是否合法、是否有唯一键等
// 创建变更记录表
CREATE /* changelog table */ TABLE _tbs_c
// 创建影子表
CREATE /* shadow table */ TABLE _tbl_s LIKE tbl
// 在影子表上应用alter语句
ALTER TABLE _tbl_s STATEMENT
// 开始行拷贝线程 tbl -> _tbl_s
// 开始binlog接收和应用线程 binlog -> _tbl_s
// 等待行拷贝线程完成
// 通知binlog线程收工
// 等待binlog线程结束
// 开始切换
LOCK TABLES tbl WRITE
RENAME TABLE tbl TO _tbl_old, _tbl_s TO tbl 
UNLOCK TABLES

确定行拷贝chunk范围

1
2
3

select id from 
   (select id from a where id>=0 and id<=3001 order by id asc limit 1000) select_osc_chunk 
order by id desc limit 1;

行拷贝in chunk

begin;
insert ignore into `a`.`_a_gho` (`id`, `value`)
  (select `id`, `value` from `a`.`a` force index (`PRIMARY`)
      where (((`id` > ?) or ((`id` = ?))) and ((`id` < ?) or ((`id` = ?)))) 
          lock in share mode
  )
commit;

关键点

async binlog worker如何判断所有数据变更已经完成

binlog worker向changelog table发一行记录，在收到这个记录时，即表示完成

RENAME race condition with DML

mysql内部保证，LOCK TABLE后，如果有DML与RENAME并发操作，那么在UNLOCK TABLES时，RENAME
一定获取最高优先级，即：RENAME一定会先执行。
否则，会丢数据:

LOCK TABLE WRITE
INSERT // blocked
RENAME // blocked
UNLOCK TABLE
INERT // 如果INSERT先执行，那么它会插入原表
RENAME // 原表被rename to tbl_old，刚才INSERT的数据丢失: 存放在了tbl_old

LOCK, RENAME

如果在一个mysql连接内执行LOCK; RENAME，那么会失败
解决办法：创建2个mysql连接，分别执行LOCK和RENAME

两段锁 2PL

2017-05-11

database

事务开始后就处于加锁阶段，一直到执行ROLLBACK和COMMIT之前都是加锁阶段。ROLLBACK和COMMIT使事务进入解锁阶段

事务遵守两段锁协议是可串行化调度的充分条件，而不是必要条件

MS SQL Server默认采用2PL实现isolation，Oralce/PostgreSQL/MySQL InnoDB默认使用MVCC

MySQL 2PL提供2种锁

shared(read)
exclusive(write)

读写互斥，但读读不互斥

MySQL serialized isolation
2pl