2 phase commit failures

Node Failure Models

  • fail-stop
    crash and never recover
  • fail-recover
    crash and later recover
  • byzantine failure

Cases

2 phase commit,n个节点,那么需要3n个消息交换

  • coordinator发送proposal后crash

    • 有的node收到,有的没收到
    • 收到Proposal的node被block forever,它可能已经vote commit了
      不能简单地timeout/abort,因为coordinator可能随时recover并启动phase2 commit
      这个txn就只能blocked by coordinator,cannot make any progress
    • 解决办法
      引入coordinator的watchdog机制,它发现coordinator crash后,接管
      Phase1. 先询问每个participants,已经vote commit还是vote abort还是没有vote
      Phase2. 通知每个participant Commit/Abort
      但仍有局限,如果有个participant crash了,那么Phase1无法确认
  • worse case
    coordinator本身也是participant

Share Comments

graph database

graph index
graph operations

Share Comments

HDD Stats Q1 2017

HDD Fail Rate

Backblaze Storage Pod

把45块SATA盘存放到一台4U机器里,其中15块搞RAID6,ext4,可以空间是裸盘空间的87%,所有访问通过tomcat HTTPS
没有iSCSI,没有NFS,没有Fibre Channel
180TB成本$9,305,即每GB $0.0517

storage pod

Backblze Vaults cloud backup service

99.999999% annual durability
已经存储150PB,由1000个Storage Pod组成,40,000块盘,每天有10块盘损坏

每个Vault由20个Storage Pod组成,每个Pod有45块盘,即每个Vault有900块盘,一块盘如果是4TB,那么每个Vault可以存3.6PB
每个磁盘使用ext4文件系统,每个Vault有个7位数字id,例如555-1021,前3位代表data center id,后4位是vault id
有个类似name service的服务,client备份前先request name service获取vault id,之后client直接与相应的vault进行backup IO(https)
pod

每个文件被分成20 shards = 17 data shars + 3 parity shards,存放在ext4
每个shard有checksum,如果损坏,可以从其他17个shards恢复
如果某个Vault有1个pod crash了,backup write的parity会变成2,如果3个pod坏了,那么也可以写,但parity就不存在了,如果此时再坏一个pod,数据无法恢复了

References

https://www.backblaze.com/blog/vault-cloud-storage-architecture/
https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/

Share Comments

OpenStack storage

  • 对象存储 Swift
  • 块存储 EBS/RBD Cinder, Ceph, Sheepdog
  • RDS Trove
Share Comments

Reed-Solomon erasure code

Intro

纠错码在RAID、备份、冷数据、历史数据存储方面使用广泛
也有人利用它把一份数据分散到多个cloud provider(e,g. S3,Azure,Rackspace),消除某个供应商的依赖: cloud of cloud

Usage

一份文件,大小x,分成n个数据块+k个校验块,能容忍任意k个数据块或者校验块错误,即至少要n个块是有效的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$encode -data 4 -par 2 README.md # n=4 k=2
Opening README.md
File split into 6 data+parity shards with 358 bytes/shard.
Writing to README.md.0
Writing to README.md.1
Writing to README.md.2
Writing to README.md.3
Writing to README.md.4 # README.md.4 and README.md5 are parity shards
Writing to README.md.5
$rm -f README.md.4 README.md.5 # remove the 2 parity shards
$decode -out x README.md
Opening README.md.0
Opening README.md.1
Opening README.md.2
Opening README.md.3
Opening README.md.4
Error reading file open README.md.4: no such file or directory
Opening README.md.5
Error reading file open README.md.5: no such file or directory
Verification failed. Reconstructing data
Writing data to x
$diff x README.md
$ # all the same

Algorithm

文件内容‘ABCDEFGHIJKLMNOP’,4+2
给定n和k,encoding矩阵是不变的

coding 4=>4+2
deocde after lost 2 shards

References

http://pages.cs.wisc.edu/~yadi/papers/yadi-infocom2013-paper.pdf

Share Comments

kafka ext4

big file

我目前使用ext4作为kafka存储,每个segment 1GB

Ext4针对大文件顺序访问的主要优化

  • replace indirect blocks with extent(盘区)

    • ext3采用多层间接地址映射,操作大文件时产生很多随机访问
      • one extra block read(seek) every 1024 blocks(4MB)
      • ext2/3
    • ext4 extent是一组连续的数据块

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      +---------+--------+----------+
      | logical | length | physical |
      +---------+--------+----------+
      + 0 | 1000 | 200 |
      +---------+--------+----------+
      type ext4_extent struct {
      block uint32 // first logical block extent covers
      len uint16 // number of blocks covered by extent
      start_hi uint16 // high 16 bits of physical block
      start uint32 // low 32 bits of physical block
      }

      ext4

  • inode prealloc
    inode有很好的locality,同一目录下文件的inode尽量存放一起,加速了目录寻址性能
  • Multiblock Allocator(MBAlloc)
    • ext3每次只能分配一个4KB(block size)的block
    • ext4支持一次性分配多个block
  • 延迟分配
    defer block allocation to writeback time
  • online defregmentation e4defrag
    1
    2
    3
    4
    5
    allocate more contiguous blocks in a temporary inode
    read a data block from the original inode
    move the corresponding block number from the temporary inode to the original inode
    write out the page

Benchmark

benchmark

Summary

ext3 vs ext4

http://www.linux-mag.com/id/7271/

Share Comments

fs atomic and ordering

应用层通过POSIX系统调用接口访问文件系统,但POSIX只规定了结果,底层的guarantee并没有提供,开发者只能去猜。
不同的文件系统的实现不同,那些不“可见”的部分,会造成很大的差异;同时,每个文件系统又有些配置,这些配置也造成理解的困难。

Example

1
2
write(f1, "pp")
write(f2, "qq")

如果write不是原子的,那么可能造成文件大小变化了,但内容没变(State#A)
乱序造成State#C
fs

  • size atomicity
  • content atomicity
  • calls out of order

Facts

persistence

atomicity

  • atomic single-sector overwrite
    目前绝大部分文件系统提供了atomic single-sector overwrite,主要是底层硬盘就已经提供了
  • atomic append
    • 需要原子地修改inode+data block
    • ext3/ext4/reiserfs writeback不支持
  • multi-block append
    目前绝大部分文件系统没有支持
  • rename/link
    这类directory operation,基本上是原子的

ordering

Share Comments

Wake on LAN

WOL是个链路层协议,通过发送UDP广播 magic packet远程power up
BIOS可以打开/关闭该功能

1
2
3
4
type MagicPacket struct {
header [6]byte // 0xFF
payload [16]MACAddress
}
Share Comments

Swinging Door Trending

SDT旋转门流式压缩,有损算法,对于变化频率不高的时间序列数据,压缩比很高,但可能丢到峰值、波谷数据
通过一条由起点和终点确定的直线代替一系列连续数据点
通过线性差值“还原”数据,pref sample <–插值–> next sample

用a点到e点之间的直线代替数据点(a,b,c,d,e)

vs facebook gorilla
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Share Comments

socket recv buffer

SO_RCVBUF

1
BDP * 4 / 3

use iperf for testing

1
2
iperf -s -w 200K # server
iperf -c localhost # client

iperf

1
2
3
4
5
6
ethtool -k eth0 # segment offload makes tcpdump see a packet bigger than MTU
net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_congestion_control
net.ipv4.tcp_moderate_rcvbuf
net.ipv4.tcp_slow_start_after_idle
Share Comments