http Preconnect

https://www.igvita.com/2015/08/17/eliminating-roundtrips-with-preconnect/

Share Comments

TLS Session Resumption

TLS

There are two mechanisms that can be used to eliminate a round trip for subsequent TLS connections (discussed below):

  • TLS session IDs
    • ServerHello时,server生成一个32字节的session ID给client,后面的TLS握手client可以在它的ClientHello里发送这个id,server就会restore the cached TLS context and avoid the 2nd round trip of TLS handshake
    • nginx支持该方式
      • ssl_session_cache
      • ssl_session_timeout
  • TLS session tickets
    与session IDs类似,只是session信息保存在client

https dialog

Share Comments

AWS Marketplace

Why shop here?

  • cloud experience
    install/deploy that software on your own EC2 with 1-Click

Why sell here?

  • AWS已经为ISV创建很好的生态
  • gain new customers
  • enable usage-based billing
Share Comments

NILFS

Intro

New Implementation of a Log-Structured File System,included in Linux 2.6.30 kernel

  • take snapshot非常简单,只要记录一下version就可以了
  • 尤其在随机的小文件读写效率更高
  • 在SSD上,NILFS2具有绝对性能优势
1
2
3
insmod nilfs2.ko
mkfs – t nilfs2 /dev/sda8
mount – t nilfs2 /nilfs /dev/sda8

Benchmark

small file
large file

vs Journal File System

JFS保存在日志里的只有metadata,而LFS利用日志记录一切

References

http://www.linux-mag.com/id/7345/

Share Comments

CryptDB

一个DB Proxy,对字段名称、记录都加密,Google根据CryptDB的设计开发了Encrypted BigQuery client
仍然存在数据泄露问题

deployment

  • CryptDB enables most DBMS functionality with a performance overhead of under 30%
  • Arx is built on top of MongoDB and reports a performance overhead of approximately 10%
  • OSPIR-EXT/SisoSPIR support MySQL
  • BlindSeer

db crypt

References

https://github.com/CryptDB/cryptdb
http://people.csail.mit.edu/nickolai/papers/raluca-cryptdb.pdf
https://people.csail.mit.edu/nickolai/papers/popa-cryptdb-tr.pdf

Share Comments

Bloom Filter

by Burton Bloom in 1970

false positive possible, false negative impossible

Dynamo, Postgresql, HBase, Bitcoin都广泛使用

BloomFilter

Share Comments

Why Do Computers Stop

Jim Gray, June 1985, Tandem Technical report 85.7

Terms

Reliability != Availability

  • availability is doing the right thing within the specified response time
    • Availability = MTBF/(MTBF + MTTR)
    • 分布式系统下,整体的可用性=各个子系统可用性的乘积
    • 模块化使得局部failure不会影响全部,冗余减少MTTR
    • 磁盘的MTBF是1万小时,即1年;如果两张盘完全独立冗余,假设MTBR是24h,那么整体的MTBF是1000年
  • reliability is not doing the wrong thing

Report

MTBF

设计容错系统的方法

process pair

  • Lockstep
    一个执行失败,就启用另外一个
    容忍了硬件故障,但没有解决Heisenbugs(难以重现的bug)
  • State Checkpointing
    primary通过消息同步把请求发送给backup,如果primary挂了,切到backup
    通过序列号来排重和发现消息丢失
    实现起来比较困难
  • Automatic Checkpointing
    kernal自动管理checkpoint,而不是让上层应用管理
    发送的消息很多
  • Delta Checkpointing
    发给backup的是logical updates,而不是physical updates
    性能更好,消息更少,但也更难实现
  • Persistence
    失败的时候可能丢失状态
    需要加入事务来提高可靠性

Fault-tolerent Communication

硬件,通过multiple data paths with independent failure modes
软件,引入session概念(类似tcp)

Fault-tolerent Storage

2份复制正好,3份不一定提高MTBF,因为其他失败因素会变为主导
分布式复制
把数据分片,会限制scope of failure

References

http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Share Comments

Amazon Aurora

Background

Launched in 2014 for MySQL, and in 2016 for PostgreSQL.

Aurora

基于shared disk的架构,storage共享来解决一致性问题,把计算节点与存储节点解耦,MySQL本身无状态,一写多读,S3做备份,本质上还是单机数据库

  • 无法访问其binlog
  • automatic storage scaling up to 64 TB, SSD
  • 数据传输通过SSL(AES-256)
  • 支持 100,000 writes/s, 500,000 read/s
  • 费用
    • 8 vCPU/61GB $1.16/h
    • 16vCPU/122GB $2.32/h
    • 32vCPU/244GB $4.62/h
      每个月相当于2万多人民币

Architecture

Aurora Overview(HM is RDS agent)
Aurora IO
IO
Group Commit
ThreadPool

Why not R(2)+W(2)>N(3) quorum?

Aurora采用的是R(3)+W(4)>N(6) 3个AZ(但必须在同一个region),每个AZ上复制2份
它保证

  • read与write集合是相交的
  • W>N/2,防止写冲突

原因

    • 2+2>3
      只能容忍一个AZ crash
    • 3+4>6
      只能容忍一个AZ crash
    • 2+2>3
      只能容忍一个AZ crash
    • 3+4>6
      能容忍一个AZ crash,此外允许另外一个node crash,即AZ+1
      为什么这个重要?因为data durability是指写进去的数据能读出来,它提高了durability

Why segmented storage

如果一个AZ crash了,就会破坏write quorum,降低availability,为了提高availability(99.99%),他们采用的方法是降低MTTR

类似ES,数据存储(ibdata)被segment化,each 10GB,total max 64TB,每个segment复制6份(3 AZ),10GB是为了能控制MTTR在10s
segment就成为了independent background noise failure and repair,后台有应用不停地检查、修复segment错误,如果不segment,那么修复成本很高
同时,是考虑到底层存储机制,做线性扩容方便

Scale

  • scale write
    只能把master机器升级到更高的ec2: scale up, not scale out
  • scale read
    add more read replicas

References

http://www.allthingsdistributed.com/files/p1041-verbitski.pdf
https://www.percona.com/blog/2016/05/26/aws-aurora-benchmarking-part-2/
http://www.tusacentral.net/joomla/index.php/mysql-blogs/175-aws-aurora-benchmarking-blast-or-splash.html

Share Comments

nagles and delayed ack

Case

同时开启情况下

1
2
3
4
5
6
7
8
9
10
client.send(1600B) // 1600>1460,defragment into Packet(1460)+Packet(140)
client.sendPacket(1460)
server.recv(1460) // no push, server awaiting the next 140
// delayed ack works, so no ack sent s->c
client.sendPacket(140) // because of nagles and has unacked data, wait till 1) data>=1460 or 2) get ack
// i,e. will not send packet(140)
... // server ack delay timeout
server.ack(1460)
client.recv(ack)
client.sendPacket(140)

delayed ack

Linux最小值20ms,它是根据RTO、RTT动态计算出来的

Nagles

  • 第一次发包,无论多大,立即发送
  • 只要发出的包都被对端ack了就可以发送了,无需等待
  • 如果没有ack,就等buffer里的包凑足MSS一起发,即它只允许1个未ack的包存在于网络,基于字节的“停-等”
1
2
3
4
5
6
7
8
9
10
11
if there is new data to send
if the window size >= MSS and available data is >= MSS
send complete MSS segment now
else
if there is unacked data still in the buffer
enqueue data in the buffer until an ack is received
else
send data immediately
end if
end if
end if

TCP_CORK vs nagles

cork:塞子

cork是一种加强的nagles算法,但它ignore ack,即使所有ack都已经收到,只要数据包不够大而且时间没到,依然不发送
cork是为了提高网络利用率,nagles是为了避免因为过多小包(payload占header比例过小)引起的网络拥堵

Share Comments

DB Storage Structures

KV

任何storage structure,数据都可以用k=>v来表示,不仅NoSQL,RDBMS也一样
例如InnoDB的primary key就是

1
primary_key => [column1, column2, ...]

secondary index也一样,因此在insert/update时有index maintaenance overhead,保持各个index的一致性

B-Tree

Designed for optimal data retrieval performance, not data storage.

RDBMS, LMDB, MongoDB

充分利用了read ahead技术

btree throughput

LSM Hash Table

bitcask

不支持range,所有index在内存的hash table里
算是一个简化版的LSM-Tree

LSM-Tree

每个SSTable的bloom filter只能帮助individual key lookup,对range query没用

Fractal-Tree

与B-Tree类似,但通过buffer changes/data compression大大降低了disk random IO
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.762&rep=rep1&type=pdf

Share Comments