2 phase commit failures

Node Failure Models

  • fail-stop
    crash and never recover
  • fail-recover
    crash and later recover
  • byzantine failure

Cases

2 phase commit,n个节点,那么需要3n个消息交换

  • coordinator发送proposal后crash

    • 有的node收到,有的没收到
    • 收到Proposal的node被block forever,它可能已经vote commit了
      不能简单地timeout/abort,因为coordinator可能随时recover并启动phase2 commit
      这个txn就只能blocked by coordinator,cannot make any progress
    • 解决办法
      引入coordinator的watchdog机制,它发现coordinator crash后,接管
      Phase1. 先询问每个participants,已经vote commit还是vote abort还是没有vote
      Phase2. 通知每个participant Commit/Abort
      但仍有局限,如果有个participant crash了,那么Phase1无法确认
  • worse case
    coordinator本身也是participant

3PC

在propose和commit这2个phase中间,加了个prepare to commit

  • 如果coordinator在prepare to commit或者proposal阶段crash
    trx aborted
  • 如果coordinator在commit阶段crash
    nodes will timeout waiting for the commit phase and commit the trx
Share Comments

graph database

graph index
graph operations

Share Comments

HDD Stats Q1 2017

HDD Fail Rate

Backblaze Storage Pod

把45块SATA盘存放到一台4U机器里,其中15块搞RAID6,ext4,可以空间是裸盘空间的87%,所有访问通过tomcat HTTPS
没有iSCSI,没有NFS,没有Fibre Channel
180TB成本$9,305,即每GB $0.0517

storage pod

Backblze Vaults cloud backup service

99.999999% annual durability
已经存储150PB,由1000个Storage Pod组成,40,000块盘,每天有10块盘损坏

每个Vault由20个Storage Pod组成,每个Pod有45块盘,即每个Vault有900块盘,一块盘如果是4TB,那么每个Vault可以存3.6PB
每个磁盘使用ext4文件系统,每个Vault有个7位数字id,例如555-1021,前3位代表data center id,后4位是vault id
有个类似name service的服务,client备份前先request name service获取vault id,之后client直接与相应的vault进行backup IO(https)
pod

每个文件被分成20 shards = 17 data shars + 3 parity shards,存放在ext4
每个shard有checksum,如果损坏,可以从其他17个shards恢复
如果某个Vault有1个pod crash了,backup write的parity会变成2,如果3个pod坏了,那么也可以写,但parity就不存在了,如果此时再坏一个pod,数据无法恢复了

References

https://www.backblaze.com/blog/vault-cloud-storage-architecture/
https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/

Share Comments

Reed-Solomon erasure code

Intro

纠错码在RAID、备份、冷数据、历史数据存储方面使用广泛
也有人利用它把一份数据分散到多个cloud provider(e,g. S3,Azure,Rackspace),消除某个供应商的依赖: cloud of cloud

Usage

一份文件,大小x,分成n个数据块+k个校验块,能容忍任意k个数据块或者校验块错误,即至少要n个块是有效的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$encode -data 4 -par 2 README.md # n=4 k=2
Opening README.md
File split into 6 data+parity shards with 358 bytes/shard.
Writing to README.md.0
Writing to README.md.1
Writing to README.md.2
Writing to README.md.3
Writing to README.md.4 # README.md.4 and README.md5 are parity shards
Writing to README.md.5
$rm -f README.md.4 README.md.5 # remove the 2 parity shards
$decode -out x README.md
Opening README.md.0
Opening README.md.1
Opening README.md.2
Opening README.md.3
Opening README.md.4
Error reading file open README.md.4: no such file or directory
Opening README.md.5
Error reading file open README.md.5: no such file or directory
Verification failed. Reconstructing data
Writing data to x
$diff x README.md
$ # all the same

Algorithm

文件内容‘ABCDEFGHIJKLMNOP’,4+2
给定n和k,encoding矩阵是不变的

coding 4=>4+2
deocde after lost 2 shards

References

http://pages.cs.wisc.edu/~yadi/papers/yadi-infocom2013-paper.pdf

Share Comments

kafka ext4

big file

我目前使用ext4作为kafka存储,每个segment 1GB

Ext4针对大文件顺序访问的主要优化

  • replace indirect blocks with extent(盘区)

    • ext3采用多层间接地址映射,操作大文件时产生很多随机访问
      • one extra block read(seek) every 1024 blocks(4MB)
      • ext2/3
    • ext4 extent是一组连续的数据块

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      +---------+--------+----------+
      | logical | length | physical |
      +---------+--------+----------+
      + 0 | 1000 | 200 |
      +---------+--------+----------+
      type ext4_extent struct {
      block uint32 // first logical block extent covers
      len uint16 // number of blocks covered by extent
      start_hi uint16 // high 16 bits of physical block
      start uint32 // low 32 bits of physical block
      }

      ext4

  • inode prealloc
    inode有很好的locality,同一目录下文件的inode尽量存放一起,加速了目录寻址性能
  • Multiblock Allocator(MBAlloc)
    • ext3每次只能分配一个4KB(block size)的block
    • ext4支持一次性分配多个block
  • 延迟分配
    defer block allocation to writeback time
  • online defregmentation e4defrag
    1
    2
    3
    4
    5
    allocate more contiguous blocks in a temporary inode
    read a data block from the original inode
    move the corresponding block number from the temporary inode to the original inode
    write out the page

Benchmark

benchmark

Summary

ext3 vs ext4

http://www.linux-mag.com/id/7271/

Share Comments

fs atomic and ordering

应用层通过POSIX系统调用接口访问文件系统,但POSIX只规定了结果,底层的guarantee并没有提供,开发者只能去猜。
不同的文件系统的实现不同,那些不“可见”的部分,会造成很大的差异;同时,每个文件系统又有些配置,这些配置也造成理解的困难。

Example

1
2
write(f1, "pp")
write(f2, "qq")

如果write不是原子的,那么可能造成文件大小变化了,但内容没变(State#A)
乱序造成State#C
fs

  • size atomicity
  • content atomicity
  • calls out of order

Facts

persistence

atomicity

  • atomic single-sector overwrite
    目前绝大部分文件系统提供了atomic single-sector overwrite,主要是底层硬盘就已经提供了
  • atomic append
    • 需要原子地修改inode+data block
    • ext3/ext4/reiserfs writeback不支持
  • multi-block append
    目前绝大部分文件系统没有支持
  • rename/link
    这类directory operation,基本上是原子的

ordering

Share Comments

Wake on LAN

WOL是个链路层协议,通过发送UDP广播 magic packet远程power up
BIOS可以打开/关闭该功能

1
2
3
4
type MagicPacket struct {
header [6]byte // 0xFF
payload [16]MACAddress
}

在facebook数据中心,开辟了一块单独的“冷存储”,专门保存不再查看的照片、视频(通常都是10年前的),由于
meta/data分离,用户想看,会wake这些存储设备

Share Comments

Swinging Door Trending

SDT旋转门流式压缩,有损算法,对于变化频率不高的时间序列数据,压缩比很高,但可能丢到峰值、波谷数据
通过一条由起点和终点确定的直线代替一系列连续数据点
通过线性差值“还原”数据,pref sample <–插值–> next sample

用a点到e点之间的直线代替数据点(a,b,c,d,e)

vs facebook gorilla
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf

Share Comments

socket recv buffer

SO_RCVBUF

1
BDP * 4 / 3

use iperf for testing

1
2
iperf -s -w 200K # server
iperf -c localhost # client

iperf

1
2
3
4
5
6
ethtool -k eth0 # segment offload makes tcpdump see a packet bigger than MTU
net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_congestion_control
net.ipv4.tcp_moderate_rcvbuf
net.ipv4.tcp_slow_start_after_idle
Share Comments

golang pkg github pull request

1
2
3
4
5
git remote add upstream https://github.com/funkygao/myrepo.git
git fetch upstream
git checkout -b mybranch
git commit -a
git push upstream mybranch
Share Comments