geo-distributed storage system

Primer

Network of network.

Why

  • scale
    one datacenter too small
  • disaster tolerance
    protect data against catastrophes
  • availability
    keep working after intermittent problems access
  • locality
    serve users from a nearby datacenter

Challanges

  • high latency
  • low bandwidth
  • congestion
    overprovisioning prohibitive
  • network partitions

Replication

1
2
3
4
5
6
read-write | state machine
----------------------+-----------------
sync ABD | Paxos
async logical clock | CRDT
ABD(Alike Backup Delegate)

Congestion

Movivation

  • bandwidth across datacenter is expensive
  • storage usage subject to spikes

如何准备带宽?

  • 按峰值
    成本高,利用率低
  • 按均值
    网络拥塞

Solution

  • weak consistency
    async replication
  • priority messages
    should be small

vivace algorithm

read-write algorithm based on ABD

算法

  • 把数据和timestamp分离
    把数据和元数据分离,元数据优先级更高
  • replicate data locally first
  • replicate timestamp remotely with prioritized msg
  • replicate data remotely in background

write

1
2
3
4
obtain timestamp ts
write data,ts to f+1 temporary local replicas (big msg)
write only ts to f+1 real replicas (small msg, cross datacenter)
in background, send data to real replicas (big msg)

read

1
2
3
read ts from f+1 replicas (small msg)
read data associated with ts from 1 replica (big msg, often local)
write only ts to f+1 real replicas (small msg, cross datacenter)

Transaction

snapshot isolation(SI)

total ordering of update transactions

问题

  • it orders the commit time of all transactions
    even those that do not conflict with each other
  • forbids some scenarios we want to allow for efficiency

parallel snapshot isolation(PSI)

causality: if T1 commits at site S before T2 starts at site S then T2 does not commit before T1 at any site

1
2
3
4
5
6
7
SI PSI
------------------- ---
1 commit time 1 commit time per site
1 timeline 1 timeline per site
read from snapshot read from snapshot at site
no w-w conflict no w-w conflict
causality property

idea1: preferred sites

每个key分配一个唯一的preferred site(例如,在长沙的用户它的preferred site就是长沙IDC),在该site上,可以使用fast commit(without cross-site communication)
类似primary/backup,不同的是preferred sites不是强制的,key可以在任何site修改

但问题是

  • what if many sites often modify a key?
  • no good way to assign a preferred site to key

idea2: CRDT

serializable

SI/PSI都有write skew问题
通过transaction chains[sosp 2013]可以有效地提供serializable isolation

References

http://dprg.cs.uiuc.edu/docs/vivace-atc/vivace-atc12.pdf

Share Comments

chain replication

与primary/backup repliation不同,是一种ROWAA(read one, write all available)方法,比quorum有更高的可用性

chain replication

References

https://www.usenix.org/legacy/event/osdi04/tech/full_papers/renesse/renesse.pdf
http://snookles.com/scott/publications/erlang2010-slf.pdf
https://github.com/CorfuDB/CorfuDB/wiki/White-papers

Share Comments

http Preconnect

https://www.igvita.com/2015/08/17/eliminating-roundtrips-with-preconnect/

Share Comments

TLS Session Resumption

TLS

There are two mechanisms that can be used to eliminate a round trip for subsequent TLS connections (discussed below):

  • TLS session IDs
    • ServerHello时,server生成一个32字节的session ID给client,后面的TLS握手client可以在它的ClientHello里发送这个id,server就会restore the cached TLS context and avoid the 2nd round trip of TLS handshake
    • nginx支持该方式
      • ssl_session_cache
      • ssl_session_timeout
  • TLS session tickets
    与session IDs类似,只是session信息保存在client

https dialog

Share Comments

AWS Marketplace

Why shop here?

  • cloud experience
    install/deploy that software on your own EC2 with 1-Click

Why sell here?

  • AWS已经为ISV创建很好的生态
  • gain new customers
  • enable usage-based billing
Share Comments

NILFS

Intro

New Implementation of a Log-Structured File System,included in Linux 2.6.30 kernel

  • take snapshot非常简单,只要记录一下version就可以了
  • 尤其在随机的小文件读写效率更高
  • 在SSD上,NILFS2具有绝对性能优势
1
2
3
insmod nilfs2.ko
mkfs – t nilfs2 /dev/sda8
mount – t nilfs2 /nilfs /dev/sda8

Benchmark

small file
large file

vs Journal File System

JFS保存在日志里的只有metadata,而LFS利用日志记录一切

References

http://www.linux-mag.com/id/7345/

Share Comments

CryptDB

一个DB Proxy,对字段名称、记录都加密,Google根据CryptDB的设计开发了Encrypted BigQuery client
仍然存在数据泄露问题

deployment

  • CryptDB enables most DBMS functionality with a performance overhead of under 30%
  • Arx is built on top of MongoDB and reports a performance overhead of approximately 10%
  • OSPIR-EXT/SisoSPIR support MySQL
  • BlindSeer

db crypt

References

https://github.com/CryptDB/cryptdb
http://people.csail.mit.edu/nickolai/papers/raluca-cryptdb.pdf
https://people.csail.mit.edu/nickolai/papers/popa-cryptdb-tr.pdf

Share Comments

Bloom Filter

by Burton Bloom in 1970

false positive possible, false negative impossible

Dynamo, Postgresql, HBase, Bitcoin都广泛使用

BloomFilter

Share Comments

Why Do Computers Stop

Jim Gray, June 1985, Tandem Technical report 85.7

Terms

Reliability != Availability

  • availability is doing the right thing within the specified response time
    • Availability = MTBF/(MTBF + MTTR)
    • 分布式系统下,整体的可用性=各个子系统可用性的乘积
    • 模块化使得局部failure不会影响全部,冗余减少MTTR
    • 磁盘的MTBF是1万小时,即1年;如果两张盘完全独立冗余,假设MTBR是24h,那么整体的MTBF是1000年
  • reliability is not doing the wrong thing

Report

MTBF

设计容错系统的方法

process pair

  • Lockstep
    一个执行失败,就启用另外一个
    容忍了硬件故障,但没有解决Heisenbugs(难以重现的bug)
  • State Checkpointing
    primary通过消息同步把请求发送给backup,如果primary挂了,切到backup
    通过序列号来排重和发现消息丢失
    实现起来比较困难
  • Automatic Checkpointing
    kernal自动管理checkpoint,而不是让上层应用管理
    发送的消息很多
  • Delta Checkpointing
    发给backup的是logical updates,而不是physical updates
    性能更好,消息更少,但也更难实现
  • Persistence
    失败的时候可能丢失状态
    需要加入事务来提高可靠性

Fault-tolerent Communication

硬件,通过multiple data paths with independent failure modes
软件,引入session概念(类似tcp)

Fault-tolerent Storage

2份复制正好,3份不一定提高MTBF,因为其他失败因素会变为主导
分布式复制
把数据分片,会限制scope of failure

References

http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Share Comments

Amazon Aurora

Background

Launched in 2014 for MySQL, and in 2016 for PostgreSQL.

Aurora

基于shared disk的架构,storage共享来解决一致性问题,把计算节点与存储节点解耦,MySQL本身无状态,一写多读,S3做备份,本质上还是单机数据库

  • 无法访问其binlog
  • automatic storage scaling up to 64 TB, SSD
  • 数据传输通过SSL(AES-256)
  • 支持 100,000 writes/s, 500,000 read/s
  • 费用
    • 8 vCPU/61GB $1.16/h
    • 16vCPU/122GB $2.32/h
    • 32vCPU/244GB $4.62/h
      每个月相当于2万多人民币

Architecture

Aurora Overview(HM is RDS agent)
Aurora IO
IO
Group Commit
ThreadPool

Why not R(2)+W(2)>N(3) quorum?

Aurora采用的是R(3)+W(4)>N(6) 3个AZ(但必须在同一个region),每个AZ上复制2份
它保证

  • read与write集合是相交的
  • W>N/2,防止写冲突

原因

    • 2+2>3
      只能容忍一个AZ crash
    • 3+4>6
      只能容忍一个AZ crash
    • 2+2>3
      只能容忍一个AZ crash
    • 3+4>6
      能容忍一个AZ crash,此外允许另外一个node crash,即AZ+1
      为什么这个重要?因为data durability是指写进去的数据能读出来,它提高了durability

Why segmented storage

如果一个AZ crash了,就会破坏write quorum,降低availability,为了提高availability(99.99%),他们采用的方法是降低MTTR

类似ES,数据存储(ibdata)被segment化,each 10GB,total max 64TB,每个segment复制6份(3 AZ),10GB是为了能控制MTTR在10s
segment就成为了independent background noise failure and repair,后台有应用不停地检查、修复segment错误,如果不segment,那么修复成本很高
同时,是考虑到底层存储机制,做线性扩容方便

Scale

  • scale write
    只能把master机器升级到更高的ec2: scale up, not scale out
  • scale read
    add more read replicas

References

http://www.allthingsdistributed.com/files/p1041-verbitski.pdf
https://www.percona.com/blog/2016/05/26/aws-aurora-benchmarking-part-2/
http://www.tusacentral.net/joomla/index.php/mysql-blogs/175-aws-aurora-benchmarking-blast-or-splash.html

Share Comments