db trigger

// read data
val = cache.get(key)
if val == nil {
    val = db.get(key)
    cache.put(key, val)
}
return val
// write data
db.put(key, val)
cache.put(key, val)

这会造成dual write conflict

如果需要的只是eventaul consistency，那么通过dbus来进行cache invalidation是最有效的

https://martinfowler.com/bliki/TwoHardThings.html

Linkedin Esprosso

2017-05-22

database

What

Distributed Document Store

RESTful API
MySQL作为存储
Helix负责集群
Databus异步replicate不同数据中心commit log
Schema存放在zookeeper，通过Avro的兼容性实现schema evolution

References

https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store

https://nonprofit.linkedin.com/content/dam/static-sites/thirdPartyJS/github-gists?e_origin=https://engineering.linkedin.com&e_channel=resource-iframe-embed-4

Multi-Data Center Consistency

2017-05-19

一致性

MDCC提供了跨机房的分布式数据库强一致性模型

mdcc

References

http://mdcc.cs.berkeley.edu/

asynchronous distributed snapshot

2017-05-19

algorithm

如何给分布式系统做个全局逻辑一致的快照?
Node State + Channel State

发送规则

node.recordState()
for conn in allConns {
    // before any conn's outbound msg
    conn.send(marker)
}

接收规则

msg = conn.recv()
if msg.isMarker() {
    t1 = now()
    if !node.stateRecorded() {
        node.recordState()
        Channel(conn) = []
    } else {
        Channel(conn) = msgsBetween(now(), t1)
        // in-flight msgs not applied on state
        node.state.apply(msgs before the marker)
    }
}

Demo

snapshot

a)
P为自己做快照P(red, green, blue)
在Channel(PQ)上 send(marker)
b)
P把绿球送给Q，这个消息是在marker后面
以此同时，Q把自己的橙色球送给P，此时Q(brown, pink)
c) 
Q在Channel(PQ)上收到marker // Q是接收者
Q为自己做快照Q(brown, pink)
Channel(PQ) = []
// 因为之前Q把自己的橙色球送给了P，因此Q也是发送者
在Channel(QP)上 send(marker)
d)
P收到橙色球，然后是marker
由于P已经记录了state, Channel(QP)=[orange, ]
最终的分布式系统的snapshot:
P(red, green, blue)
Channel(PQ) []
Q(brown, pink)
Channel(QP) = [orange, ]

FAQ

如何发起

发起global distributed snapshot的节点，可以是一台，也可以多台并发

如何结束

所有节点上都完成了snapshot

用途

故障恢复

与Apache Storm的基于记录的ack不同，Apache Flink的failure recovery采用了改进的Chandy-Lamport算法
checkpoint coordinator是JobManager

data sources periodically inject markers into the data stream.

1
2
3

val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
env.enableCheckpointing(1000) // 数据源每1s发送marker(barrier)

Whenever an operator receives such a marker, it checkpoints its internal state.

class StateMachineMapper extends FlatMapFunction[Event, Alert] with Checkpointed[mutable.HashMap[Int, State]] {
    private[this] val states = new mutable.HashMap[Int, State]()
    override def flatMap(t: Event, out: Collector[Alert]): Unit = {
        // get and remove the current state
        val state = states.remove(t.sourceAddress).getOrElse(InitialState)
        val nextState = state.transition(t.event)
        if (nextState == InvalidTransition) {
            // 报警
            out.collect(Alert(t.sourceAddress, state, t.event))
        } else if (!nextState.terminal) {
            // put back to states
            states.put(t.sourceAddress, nextState)
        }
    }
    override def snapshotState(checkpointId: Long, timestamp: Long): mutable.HashMap[Int, State] = {
        // barrier(marker) injected from data source and flows with the records as part of the data stream
        //
        // snapshotState()与flatMap()一定是串行执行的
        // 此时operator已经收到了barrier(marker)
        // 在本方法返回后，flink会自动把barrier发给我的output streams
        // 再然后，保存states(默认是JobManager内存，也可以HDFS)
        states
    }
    override def restoreState(state: mutable.HashMap[Int, State]): Unit = {
        // 出现故障后，flink会停止dataflow，然后重启operator(StateMachineMapper)
        states ++= state
    }
}

snapshot

References

http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
https://arxiv.org/abs/1506.08603
https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html
https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine

https

2017-05-19

protocol

curl https://baidu.com
How the 270ms passed

1 1  0.0721 (0.0721)  C>S  Handshake
      ClientHello
        Version 3.1
        cipher suites
        TLS_EMPTY_RENEGOTIATION_INFO_SCSV
        TLS_DHE_RSA_WITH_AES_256_CBC_SHA
        TLS_DHE_RSA_WITH_AES_256_CBC_SHA256
        TLS_DHE_DSS_WITH_AES_256_CBC_SHA
        TLS_RSA_WITH_AES_256_CBC_SHA
        TLS_RSA_WITH_AES_256_CBC_SHA256
        TLS_DHE_RSA_WITH_AES_128_CBC_SHA
        TLS_DHE_RSA_WITH_AES_128_CBC_SHA256
        TLS_DHE_DSS_WITH_AES_128_CBC_SHA
        TLS_RSA_WITH_RC4_128_SHA
        TLS_RSA_WITH_RC4_128_MD5
        TLS_RSA_WITH_AES_128_CBC_SHA
        TLS_RSA_WITH_AES_128_CBC_SHA256
        TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA
        TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA
        TLS_RSA_WITH_3DES_EDE_CBC_SHA
        compression methods
                  NULL
1 2  0.1202 (0.0480)  S>C  Handshake
      ServerHello
        Version 3.1
        session_id[32]=
          b3 ea 99 ee 5a 4c 03 e8 e0 74 95 09 f1 11 09 2a
          9d f5 8f 2a 26 7a d3 7f 71 ff dc 39 62 66 b0 f9
        cipherSuite         TLS_RSA_WITH_AES_128_CBC_SHA
        compressionMethod                   NULL
1 3  0.1205 (0.0002)  S>C  Handshake
      Certificate
1 4  0.1205 (0.0000)  S>C  Handshake
      ServerHelloDone
1 5  0.1244 (0.0039)  C>S  Handshake
      ClientKeyExchange
1 6  0.1244 (0.0000)  C>S  ChangeCipherSpec
1 7  0.1244 (0.0000)  C>S  Handshake
1 8  0.1737 (0.0492)  S>C  ChangeCipherSpec
1 9  0.1737 (0.0000)  S>C  Handshake
1 10 0.1738 (0.0001)  C>S  application_data
1 11 0.2232 (0.0493)  S>C  application_data
1 12 0.2233 (0.0001)  C>S  Alert
1    0.2234 (0.0000)  C>S  TCP FIN
1    0.2709 (0.0475)  S>C  TCP FIN

hybrid logical clock

2017-05-19

algorithm

分布式事务，为了性能，目前通常提供SI/SSI级别的isolation，通过乐观冲突检测
而非2PC悲观方式实现，这就要求实现事务的causality，通常都是拿逻辑时钟实现total order
例如vector clock就是一种，zab里的zxid也是；google percolator里的total order算是
另外一种逻辑时钟，但这种方法由于有明显瓶颈，也增加了一次消息传递

但逻辑时钟无法反应物理时钟，因此有人提出了混合时钟，wall time + logical time，分别是
给人看和给机器看，原理比较简单，就是在交互消息时，接收方一定sender event happens before receiver

但wall time本身比较脆弱，例如一个集群，有台机器ntp出现问题，管理员调整时间的时候出现人为
错误，本来应该是2017-09-09 10:00:00，结果typo成2071-09-09 10:00:00，后果是它会传染给集群
内所有机器，hlc里的wall time都会变成2071年，人工无法修复，除非允许丢弃历史数据，只有等
到2071年那一天系统会自动恢复，wall time部分也就失去了意义

要解决这个问题，可以加入epoch

HLC
+-------+-----------+--------------+
| epoch | wall time | logical time |
+-------+-----------+--------------+

修复2071问题时，只需把epoch+1

可靠性金字塔 SRE

2017-05-18

architecture

SRE

MySQL B+ Tree

2017-05-18

database

sharding