Bloom Filter

by Burton Bloom in 1970

false positive possible, false negative impossible

Dynamo, Postgresql, HBase, Bitcoin都广泛使用

BloomFilter

Share Comments

Why Do Computers Stop

Jim Gray, June 1985, Tandem Technical report 85.7

Terms

Reliability != Availability

  • availability is doing the right thing within the specified response time
    • Availability = MTBF/(MTBF + MTTR)
    • 分布式系统下,整体的可用性=各个子系统可用性的乘积
    • 模块化使得局部failure不会影响全部,冗余减少MTTR
    • 磁盘的MTBF是1万小时,即1年;如果两张盘完全独立冗余,假设MTBR是24h,那么整体的MTBF是1000年
  • reliability is not doing the wrong thing

Report

MTBF

设计容错系统的方法

process pair

  • Lockstep
    一个执行失败,就启用另外一个
    容忍了硬件故障,但没有解决Heisenbugs(难以重现的bug)
  • State Checkpointing
    primary通过消息同步把请求发送给backup,如果primary挂了,切到backup
    通过序列号来排重和发现消息丢失
    实现起来比较困难
  • Automatic Checkpointing
    kernal自动管理checkpoint,而不是让上层应用管理
    发送的消息很多
  • Delta Checkpointing
    发给backup的是logical updates,而不是physical updates
    性能更好,消息更少,但也更难实现
  • Persistence
    失败的时候可能丢失状态
    需要加入事务来提高可靠性

Fault-tolerent Communication

硬件,通过multiple data paths with independent failure modes
软件,引入session概念(类似tcp)

Fault-tolerent Storage

2份复制正好,3份不一定提高MTBF,因为其他失败因素会变为主导
分布式复制
把数据分片,会限制scope of failure

References

http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Share Comments

Amazon Aurora

Background

Launched in 2014 for MySQL, and in 2016 for PostgreSQL.

Aurora

基于shared disk的架构,storage共享来解决一致性问题,把计算节点与存储节点解耦,MySQL本身无状态,一写多读,S3做备份,本质上还是单机数据库

  • 无法访问其binlog
  • automatic storage scaling up to 64 TB, SSD
  • 数据传输通过SSL(AES-256)
  • 支持 100,000 writes/s, 500,000 read/s
  • 费用
    • 8 vCPU/61GB $1.16/h
    • 16vCPU/122GB $2.32/h
    • 32vCPU/244GB $4.62/h
      每个月相当于2万多人民币

Architecture

Aurora Overview(HM is RDS agent)
Aurora IO
IO
Group Commit
ThreadPool

Why not R(2)+W(2)>N(3) quorum?

Aurora采用的是R(3)+W(4)>N(6) 3个AZ(但必须在同一个region),每个AZ上复制2份
它保证

  • read与write集合是相交的
  • W>N/2,防止写冲突

原因

    • 2+2>3
      只能容忍一个AZ crash
    • 3+4>6
      只能容忍一个AZ crash
    • 2+2>3
      只能容忍一个AZ crash
    • 3+4>6
      能容忍一个AZ crash,此外允许另外一个node crash,即AZ+1
      为什么这个重要?因为data durability是指写进去的数据能读出来,它提高了durability

Why segmented storage

如果一个AZ crash了,就会破坏write quorum,降低availability,为了提高availability(99.99%),他们采用的方法是降低MTTR

类似ES,数据存储(ibdata)被segment化,each 10GB,total max 64TB,每个segment复制6份(3 AZ),10GB是为了能控制MTTR在10s
segment就成为了independent background noise failure and repair,后台有应用不停地检查、修复segment错误,如果不segment,那么修复成本很高
同时,是考虑到底层存储机制,做线性扩容方便

Scale

  • scale write
    只能把master机器升级到更高的ec2: scale up, not scale out
  • scale read
    add more read replicas

References

http://www.allthingsdistributed.com/files/p1041-verbitski.pdf
https://www.percona.com/blog/2016/05/26/aws-aurora-benchmarking-part-2/
http://www.tusacentral.net/joomla/index.php/mysql-blogs/175-aws-aurora-benchmarking-blast-or-splash.html

Share Comments

nagles and delayed ack

Case

同时开启情况下

1
2
3
4
5
6
7
8
9
10
client.send(1600B) // 1600>1460,defragment into Packet(1460)+Packet(140)
client.sendPacket(1460)
server.recv(1460) // no push, server awaiting the next 140
// delayed ack works, so no ack sent s->c
client.sendPacket(140) // because of nagles and has unacked data, wait till 1) data>=1460 or 2) get ack
// i,e. will not send packet(140)
... // server ack delay timeout
server.ack(1460)
client.recv(ack)
client.sendPacket(140)

delayed ack

Linux最小值20ms,它是根据RTO、RTT动态计算出来的

Nagles

  • 第一次发包,无论多大,立即发送
  • 只要发出的包都被对端ack了就可以发送了,无需等待
  • 如果没有ack,就等buffer里的包凑足MSS一起发,即它只允许1个未ack的包存在于网络,基于字节的“停-等”
1
2
3
4
5
6
7
8
9
10
11
if there is new data to send
if the window size >= MSS and available data is >= MSS
send complete MSS segment now
else
if there is unacked data still in the buffer
enqueue data in the buffer until an ack is received
else
send data immediately
end if
end if
end if

TCP_CORK vs nagles

cork:塞子

cork是一种加强的nagles算法,但它ignore ack,即使所有ack都已经收到,只要数据包不够大而且时间没到,依然不发送
cork是为了提高网络利用率,nagles是为了避免因为过多小包(payload占header比例过小)引起的网络拥堵

Share Comments

DB Storage Structures

KV

任何storage structure,数据都可以用k=>v来表示,不仅NoSQL,RDBMS也一样
例如InnoDB的primary key就是

1
primary_key => [column1, column2, ...]

secondary index也一样,因此在insert/update时有index maintaenance overhead,保持各个index的一致性

B-Tree

Designed for optimal data retrieval performance, not data storage.

RDBMS, LMDB, MongoDB

充分利用了read ahead技术

btree throughput

LSM Hash Table

bitcask

不支持range,所有index在内存的hash table里
算是一个简化版的LSM-Tree

LSM-Tree

每个SSTable的bloom filter只能帮助individual key lookup,对range query没用

Fractal-Tree

与B-Tree类似,但通过buffer changes/data compression大大降低了disk random IO
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.762&rep=rep1&type=pdf

Share Comments

Kafka Streams

Kafka 0.10提供,可以部分替代SparkStreaming/Storm/Samza/Flink,好处是仅仅依赖kafka,全部通过SDK实现流式处理

通过KeyValueStore interface实现stateful processor,目前有2个实现

  • in memory
  • RocksDB

其实跟dbus的设计差不多

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
JsonDeserializer<Purchase> purchaseJsonDeserializer = new JsonDeserializer<>(Purchase.class);
JsonSerializer<Purchase> purchaseJsonSerializer = new JsonSerializer<>();
JsonSerializer<RewardAccumulator> rewardAccumulatorJsonSerializer = new JsonSerializer<>();
JsonSerializer<PurchasePattern> purchasePatternJsonSerializer = new JsonSerializer<>();
StringDeserializer stringDeserializer = new StringDeserializer();
StringSerializer stringSerializer = new StringSerializer();
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, purchaseJsonDeserializer, "src-topic")
.addProcessor("PROCESS", CreditCardAnonymizer::new, "SOURCE")
.addProcessor("PROCESS2", PurchasePatterns::new, "PROCESS")
.addProcessor("PROCESS3", CustomerRewards::new, "PROCESS")
// kafka(src-topic) -> SOURCE -> PROCESS -+-> PROCESS2
// +-> PROCESS3
.addSink("SINK", "patterns", stringSerializer, purchasePatternJsonSerializer, "PROCESS2")
.addSink("SINK2", "rewards", stringSerializer, rewardAccumulatorJsonSerializer, "PROCESS3")
.addSink("SINK3", "purchases", stringSerializer, purchaseJsonSerializer, "PROCESS");
KafkaStreams streaming = new KafkaStreams(topologyBuilder, streamingConfig);
streaming.start();

References

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client

Share Comments

2 phase commit failures

Node Failure Models

  • fail-stop
    crash and never recover
  • fail-recover
    crash and later recover
  • byzantine failure

Cases

2 phase commit,n个节点,那么需要3n个消息交换

  • coordinator发送proposal后crash

    • 有的node收到,有的没收到
    • 收到Proposal的node被block forever,它可能已经vote commit了
      不能简单地timeout/abort,因为coordinator可能随时recover并启动phase2 commit
      这个txn就只能blocked by coordinator,cannot make any progress
    • 解决办法
      引入coordinator的watchdog机制,它发现coordinator crash后,接管
      Phase1. 先询问每个participants,已经vote commit还是vote abort还是没有vote
      Phase2. 通知每个participant Commit/Abort
      但仍有局限,如果有个participant crash了,那么Phase1无法确认
  • worse case
    coordinator本身也是participant

Share Comments

graph database

graph index
graph operations

Share Comments

HDD Stats Q1 2017

HDD Fail Rate

Backblaze Storage Pod

把45块SATA盘存放到一台4U机器里,其中15块搞RAID6,ext4,可以空间是裸盘空间的87%,所有访问通过tomcat HTTPS
没有iSCSI,没有NFS,没有Fibre Channel
180TB成本$9,305,即每GB $0.0517

storage pod

Backblze Vaults cloud backup service

99.999999% annual durability
已经存储150PB,由1000个Storage Pod组成,40,000块盘,每天有10块盘损坏

每个Vault由20个Storage Pod组成,每个Pod有45块盘,即每个Vault有900块盘,一块盘如果是4TB,那么每个Vault可以存3.6PB
每个磁盘使用ext4文件系统,每个Vault有个7位数字id,例如555-1021,前3位代表data center id,后4位是vault id
有个类似name service的服务,client备份前先request name service获取vault id,之后client直接与相应的vault进行backup IO(https)
pod

每个文件被分成20 shards = 17 data shars + 3 parity shards,存放在ext4
每个shard有checksum,如果损坏,可以从其他17个shards恢复
如果某个Vault有1个pod crash了,backup write的parity会变成2,如果3个pod坏了,那么也可以写,但parity就不存在了,如果此时再坏一个pod,数据无法恢复了

References

https://www.backblaze.com/blog/vault-cloud-storage-architecture/
https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/

Share Comments

Reed-Solomon erasure code

Intro

纠错码在RAID、备份、冷数据、历史数据存储方面使用广泛
也有人利用它把一份数据分散到多个cloud provider(e,g. S3,Azure,Rackspace),消除某个供应商的依赖: cloud of cloud

Usage

一份文件,大小x,分成n个数据块+k个校验块,能容忍任意k个数据块或者校验块错误,即至少要n个块是有效的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$encode -data 4 -par 2 README.md # n=4 k=2
Opening README.md
File split into 6 data+parity shards with 358 bytes/shard.
Writing to README.md.0
Writing to README.md.1
Writing to README.md.2
Writing to README.md.3
Writing to README.md.4 # README.md.4 and README.md5 are parity shards
Writing to README.md.5
$rm -f README.md.4 README.md.5 # remove the 2 parity shards
$decode -out x README.md
Opening README.md.0
Opening README.md.1
Opening README.md.2
Opening README.md.3
Opening README.md.4
Error reading file open README.md.4: no such file or directory
Opening README.md.5
Error reading file open README.md.5: no such file or directory
Verification failed. Reconstructing data
Writing data to x
$diff x README.md
$ # all the same

Algorithm

文件内容‘ABCDEFGHIJKLMNOP’,4+2
给定n和k,encoding矩阵是不变的

coding 4=>4+2
deocde after lost 2 shards

References

http://pages.cs.wisc.edu/~yadi/papers/yadi-infocom2013-paper.pdf

Share Comments