CryptDB

2017-06-26

database

一个DB Proxy，对字段名称、记录都加密，Google根据CryptDB的设计开发了Encrypted BigQuery client
仍然存在数据泄露问题

deployment

CryptDB enables most DBMS functionality with a performance overhead of under 30%
Arx is built on top of MongoDB and reports a performance overhead of approximately 10%
OSPIR-EXT/SisoSPIR support MySQL
BlindSeer

db crypt

References

https://github.com/CryptDB/cryptdb
http://people.csail.mit.edu/nickolai/papers/raluca-cryptdb.pdf
https://people.csail.mit.edu/nickolai/papers/popa-cryptdb-tr.pdf

Bloom Filter

2017-06-26

algorithm

by Burton Bloom in 1970

false positive possible, false negative impossible

Dynamo, Postgresql, HBase, Bitcoin都广泛使用

BloomFilter

Why Do Computers Stop

2017-06-26

algorithm

Jim Gray, June 1985, Tandem Technical report 85.7

Terms

Reliability != Availability

availability is doing the right thing within the specified response time
- Availability = MTBF/(MTBF + MTTR)
- 分布式系统下，整体的可用性=各个子系统可用性的乘积
- 模块化使得局部failure不会影响全部，冗余减少MTTR
- 磁盘的MTBF是1万小时，即1年；如果两张盘完全独立冗余，假设MTBR是24h，那么整体的MTBF是1000年
reliability is not doing the wrong thing

Report

MTBF

设计容错系统的方法

process pair

Lockstep
一个执行失败，就启用另外一个
容忍了硬件故障，但没有解决Heisenbugs(难以重现的bug)
State Checkpointing
primary通过消息同步把请求发送给backup，如果primary挂了，切到backup
通过序列号来排重和发现消息丢失
实现起来比较困难
Automatic Checkpointing
kernal自动管理checkpoint，而不是让上层应用管理
发送的消息很多
Delta Checkpointing
发给backup的是logical updates，而不是physical updates
性能更好，消息更少，但也更难实现
Persistence
失败的时候可能丢失状态
需要加入事务来提高可靠性

Fault-tolerent Communication

硬件，通过multiple data paths with independent failure modes
软件，引入session概念(类似tcp)

Fault-tolerent Storage

2份复制正好，3份不一定提高MTBF，因为其他失败因素会变为主导
分布式复制
把数据分片，会限制scope of failure

References

http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Amazon Aurora

2017-06-23

cloud, database

Background

Launched in 2014 for MySQL, and in 2016 for PostgreSQL.

Aurora

基于shared disk的架构，storage共享来解决一致性问题，把计算节点与存储节点解耦，MySQL本身无状态，一写多读，S3做备份，本质上还是单机数据库

无法访问其binlog
automatic storage scaling up to 64 TB, SSD
数据传输通过SSL(AES-256)
支持 100,000 writes/s, 500,000 read/s
费用
- 8 vCPU/61GB $1.16/h
- 16vCPU/122GB $2.32/h
- 32vCPU/244GB $4.62/h
  每个月相当于2万多人民币

Architecture

Aurora Overview(HM is RDS agent)
Aurora IO

Group Commit
ThreadPool

Why not R(2)+W(2)>N(3) quorum?

Aurora采用的是R(3)+W(4)>N(6) 3个AZ(但必须在同一个region)，每个AZ上复制2份
它保证

read与write集合是相交的
W>N/2，防止写冲突

原因

写
- 2+2>3
  只能容忍一个AZ crash
- 3+4>6
  只能容忍一个AZ crash
读
- 2+2>3
  只能容忍一个AZ crash
- 3+4>6
  能容忍一个AZ crash，此外允许另外一个node crash，即AZ+1
  为什么这个重要？因为data durability是指写进去的数据能读出来，它提高了durability

Why segmented storage

如果一个AZ crash了，就会破坏write quorum，降低availability，为了提高availability(99.99%)，他们采用的方法是降低MTTR

类似ES，数据存储(ibdata)被segment化，each 10GB，total max 64TB，每个segment复制6份(3 AZ)，10GB是为了能控制MTTR在10s
segment就成为了independent background noise failure and repair，后台有应用不停地检查、修复segment错误，如果不segment，那么修复成本很高
同时，是考虑到底层存储机制，做线性扩容方便

Scale

scale write
只能把master机器升级到更高的ec2: scale up, not scale out
scale read
add more read replicas

References

http://www.allthingsdistributed.com/files/p1041-verbitski.pdf
https://www.percona.com/blog/2016/05/26/aws-aurora-benchmarking-part-2/
http://www.tusacentral.net/joomla/index.php/mysql-blogs/175-aws-aurora-benchmarking-blast-or-splash.html

nagles and delayed ack

2017-06-23

algorithm

Case

同时开启情况下

client.send(1600B) // 1600>1460，defragment into Packet(1460)+Packet(140)
client.sendPacket(1460)
server.recv(1460) // no push, server awaiting the next 140
// delayed ack works, so no ack sent s->c
client.sendPacket(140) // because of nagles and has unacked data, wait till 1) data>=1460 or 2) get ack
                       // i,e. will not send packet(140)
... // server ack delay timeout
server.ack(1460)
client.recv(ack)
client.sendPacket(140)

delayed ack

Linux最小值20ms，它是根据RTO、RTT动态计算出来的

Nagles

第一次发包，无论多大，立即发送
只要发出的包都被对端ack了就可以发送了，无需等待
如果没有ack，就等buffer里的包凑足MSS一起发，即它只允许1个未ack的包存在于网络，基于字节的“停-等”

if there is new data to send
  if the window size >= MSS and available data is >= MSS
        send complete MSS segment now
  else
    if there is unacked data still in the buffer
          enqueue data in the buffer until an ack is received
    else
          send data immediately
    end if
  end if
end if

TCP_CORK vs nagles

cork：塞子

cork是一种加强的nagles算法，但它ignore ack，即使所有ack都已经收到，只要数据包不够大而且时间没到，依然不发送
cork是为了提高网络利用率，nagles是为了避免因为过多小包(payload占header比例过小)引起的网络拥堵

DB Storage Structures

2017-06-22

storage

KV

任何storage structure，数据都可以用k=>v来表示，不仅NoSQL，RDBMS也一样
例如InnoDB的primary key就是

1	primary_key => [column1, column2, ...]

secondary index也一样，因此在insert/update时有index maintaenance overhead，保持各个index的一致性

B-Tree

Designed for optimal data retrieval performance, not data storage.

RDBMS, LMDB, MongoDB

充分利用了read ahead技术

btree throughput

LSM Hash Table

bitcask

不支持range，所有index在内存的hash table里
算是一个简化版的LSM-Tree

LSM-Tree

每个SSTable的bloom filter只能帮助individual key lookup，对range query没用

Fractal-Tree

与B-Tree类似，但通过buffer changes/data compression大大降低了disk random IO
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.762&rep=rep1&type=pdf

Kafka Streams

2017-06-22

PubSub

Kafka 0.10提供，可以部分替代SparkStreaming/Storm/Samza/Flink，好处是仅仅依赖kafka，全部通过SDK实现流式处理

通过KeyValueStore interface实现stateful processor，目前有2个实现

in memory
RocksDB

其实跟dbus的设计差不多

JsonDeserializer<Purchase> purchaseJsonDeserializer = new JsonDeserializer<>(Purchase.class);
JsonSerializer<Purchase> purchaseJsonSerializer = new JsonSerializer<>();
JsonSerializer<RewardAccumulator> rewardAccumulatorJsonSerializer = new JsonSerializer<>();
JsonSerializer<PurchasePattern> purchasePatternJsonSerializer = new JsonSerializer<>();
StringDeserializer stringDeserializer = new StringDeserializer();
StringSerializer stringSerializer = new StringSerializer();
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, purchaseJsonDeserializer, "src-topic")
    .addProcessor("PROCESS", CreditCardAnonymizer::new, "SOURCE")
    .addProcessor("PROCESS2", PurchasePatterns::new, "PROCESS")
    .addProcessor("PROCESS3", CustomerRewards::new, "PROCESS")
    // kafka(src-topic) -> SOURCE -> PROCESS -+-> PROCESS2 
    //                                        +-> PROCESS3
    .addSink("SINK", "patterns", stringSerializer, purchasePatternJsonSerializer, "PROCESS2")
    .addSink("SINK2", "rewards", stringSerializer, rewardAccumulatorJsonSerializer, "PROCESS3")
    .addSink("SINK3", "purchases", stringSerializer, purchaseJsonSerializer, "PROCESS");
KafkaStreams streaming = new KafkaStreams(topologyBuilder, streamingConfig);
streaming.start();

References

https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client

2 phase commit failures

2017-06-14

一致性

Node Failure Models

fail-stop
crash and never recover
fail-recover
crash and later recover
byzantine failure

Cases

2 phase commit，n个节点，那么需要3n个消息交换

coordinator发送proposal后crash
- 有的node收到，有的没收到
- 收到Proposal的node被block forever，它可能已经vote commit了
  不能简单地timeout/abort，因为coordinator可能随时recover并启动phase2 commit
  这个txn就只能blocked by coordinator，cannot make any progress
- 解决办法
  引入coordinator的watchdog机制，它发现coordinator crash后，接管
  Phase1. 先询问每个participants，已经vote commit还是vote abort还是没有vote
  Phase2. 通知每个participant Commit/Abort
  但仍有局限，如果有个participant crash了，那么Phase1无法确认
worse case
coordinator本身也是participant

3PC

在propose和commit这2个phase中间，加了个prepare to commit

如果coordinator在prepare to commit或者proposal阶段crash
trx aborted
如果coordinator在commit阶段crash
nodes will timeout waiting for the commit phase and commit the trx

graph database

2017-06-14

database

graph index
graph operations

HDD Stats Q1 2017

2017-06-14

storage

HDD Fail Rate

Backblaze Storage Pod

把45块SATA盘存放到一台4U机器里，其中15块搞RAID6，ext4，可以空间是裸盘空间的87%，所有访问通过tomcat HTTPS
没有iSCSI，没有NFS，没有Fibre Channel
180TB成本$9,305，即每GB $0.0517

storage pod

Backblze Vaults cloud backup service

99.999999% annual durability
已经存储150PB，由1000个Storage Pod组成，40,000块盘，每天有10块盘损坏

每个Vault由20个Storage Pod组成，每个Pod有45块盘，即每个Vault有900块盘，一块盘如果是4TB，那么每个Vault可以存3.6PB
每个磁盘使用ext4文件系统，每个Vault有个7位数字id，例如555-1021，前3位代表data center id，后4位是vault id
有个类似name service的服务，client备份前先request name service获取vault id，之后client直接与相应的vault进行backup IO(https)
pod

每个文件被分成20 shards = 17 data shars + 3 parity shards，存放在ext4
每个shard有checksum，如果损坏，可以从其他17个shards恢复
如果某个Vault有1个pod crash了，backup write的parity会变成2，如果3个pod坏了，那么也可以写，但parity就不存在了，如果此时再坏一个pod，数据无法恢复了

References

https://www.backblaze.com/blog/vault-cloud-storage-architecture/
https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/

Related

References

Terms

Report

设计容错系统的方法

Fault-tolerent Communication

Fault-tolerent Storage

References

Background

Aurora

Architecture

Why not R(2)+W(2)>N(3) quorum?

Why segmented storage

Scale

References

Case

delayed ack

Nagles

TCP_CORK vs nagles

KV

B-Tree

LSM Hash Table

LSM-Tree

Fractal-Tree

References

Node Failure Models

Cases

3PC

Backblaze Storage Pod

Backblze Vaults cloud backup service

References