Database as a Service

Trove

https://wiki.openstack.org/wiki/Trove

Share Comments

DB Schema Management

https://github.com/skeema/skeema

统一控制dev/test/staging/prod等环境的scheme

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$skeema
Skeema is a MySQL schema management tool. It allows you to export a database
schema to the filesystem, and apply online schema changes by modifying files.
Usage:
skeema [<options>] <command>
Commands:
add-environment Add a new named environment to an existing host directory
diff Compare a DB instance's schemas and tables to the filesystem
help Display usage information
init Save a DB instance's schemas and tables to the filesystem
lint Verify table files and reformat them in a standardized way
pull Update the filesystem representation of schemas and tables
push Alter tables on DBs to reflect the filesystem representation
version Display program version

mysql在线alter table设计不同,它是higher level的,底层仍旧需要OSC支持:

1
alter-wrapper="/usr/local/bin/pt-online-schema-change --execute --alter {CLAUSES} D={SCHEMA},t={TABLE},h={HOST},P={PORT},u={USER},p={PASSWORDX}"

References

https://www.percona.com/live/17/sessions/automatic-mysql-schema-management-skeema

Share Comments

Cassandra vs ScyllaDB

Cassandra

Cassandra 项目诞生于 Facebook,后来团队有人跳到 Amazon 做了另外一个 NoSQL 数据库 DynamoDB。
最开始由两个facebook的员工最开始开发出来的,其中一个还直接参与了Amazon的Dynamo的开发。

Dynamo论文发表于2007年,用于shopping cart
Cassandra在2008年被facebook开源,用于inbox search
Uber现在有全球最大的Cassandra data center

Features

  • CQL(Cassandra Query Language)
  • 1,000 node clusters
  • multi-data center
  • out-of-the-box replication
  • ecosystem with Spark

Internals

  • LSM Tree
  • Gossip P2P
  • DHT
  • consistency same as DynamoDB
    • ONE
    • QUORUM
    • ALL
    • read repair
  • Thrift

ScyllaDB

KVM核心人员用C++写的Cassandra(Java) clone,单机性能提高了10倍,主要原因是:

  • DPDK, bypass kernel
  • O_DIRECT IO, bypass pagecache, cache由scylla自己管理
    • pagecahce的格式必须是文件的格式(sstable),而app level cache更有效,更terse
    • compaction的时候,pagecache讲是个累赘,它可能造成很多热点数据的淘汰
  • 把一个node看做是多个cpu core组成的cluster, share nothing
  • sharding at the cpu core instead of node
    更充分利用多核,减少contention,充分利用cpu cache, NUMA friendly
  • 在需要core间交换数据时,使用explicit message passing
  • avoid JVM GC

References

https://db-engines.com/en/ranking
https://github.com/scylladb/scylla
http://www.scylladb.com/
http://www.seastar-project.org/
https://www.reddit.com/r/programming/comments/3lzz56/scylladb_cassandra_rewritten_in_c_claims_to_be_up/
https://news.ycombinator.com/item?id=10262719

Share Comments

etcd3 vs zookeeper

etcd v3独有的特性

  • get and watch by prefix, by interval
  • lease based TTL for key sets
  • runtime reconfiguration
  • point in time backup
  • extensive metrics
  • 获取历史版本数据(这个非常有用)
    multi-version
  • mini transation DSL

    1
    2
    3
    4
    5
    6
    7
    8
    9
    Tx.If(
    Compare(Value("foo"), ">", "bar"),
    Compare(Value(Version("foo"), "=", 2),
    ...
    ).Then(
    Put("ok", "true")...
    ).Else(
    Put("ok", "false")...
    ).Commit()
  • leases

    1
    2
    3
    4
    l = CreateLeases(15*second)
    Put(foo, bar, l)
    l.KeepAlive()
    l.Revoke()
  • watcher功能丰富

    • streaming watch
    • 支持index参数,不会lose event
    • recursive
  • off-heap
    内存中只保留index,大部分数据通过mmap映射到boltdb file
  • incremental snapshot

zk独有的特性

  • ephemeral znode
  • non-blocking full fuzzy snapshot
    Too busy to snap, skipping
  • key支持在N Millions
  • on-heap

etcd2

  • etcd2的key支持在10K量级,etcd3支持1M量级
    • 原因在于snapshot成本,可能导致0 qps,甚至reelection

comparison

memory footprint

2M 256B keys

1
2
3
etcd2 10GB
zk 2.4GB
etcd3 0.8GB

References

https://coreos.com/blog/performance-of-etcd.html

Share Comments

2017 kafka report

调查来自47个国家的388个组织(公司)
26%受访者年销售额10亿美金以上
15%受访者每天处理10亿消息/天
43%受访者在公有云上使用kafka,其中60%是AWS

report
usage

References

https://www.confluent.io/wp-content/uploads/2017-Apache-Kafka-Report.pdf

Share Comments

zookeeper processor

Chain of Responsibility
为了实现各种服务器的代码结构的高度统一,不同角色的server对应不同的processor chain

1
2
3
interface RequestProcessor {
void processRequest(Request request) throws RequestProcessorException;
}

LeaderZooKeeperServer.java
Leader
Leader

FollowerZooKeeperServer.java
Follower
Follower

ZooKeeperServer.java

1
2
3
4
5
6
7
func processPacket() {
submitRequest()
}
func submitRequest(req) {
firstProcessor.processRequest(req)
}

Share Comments

kafka redesign

Goals

  • support many topics
    • needle in haystack
  • IO optimization
    • R/W isolation
    • index file leads to random sync write
Share Comments

apache bookeeper

Features

  • 没有topic/partition概念,一个stream是由多个ledger组成的,每个ledger是有边界的
    createLedger(int ensSize, int writeQuorumSize, int ackQuorumSize)
  • ledger只有int id,没有名字
  • 每个entry(log)是有unique int64 id的
  • striped write: 交错存储
  • 各个存储节点bookie之间没有明确的主从关系
  • shared WAL
  • single writer
  • Quorum投票复制,通过存储在zk里的LastAddConfirmedId确保read consistency
  • bookie does not communicate with other bookies,由client进行并发broadcast/quorum

VS Kafka

vs kafka

createLedger时,客户端决定如何placement(org.apache.bookkeeper.client.EnsemblePlacementPolicy),然后存放在zookeeper
例如,5个bookie,createLedger(3, 3, 2)

IO Model

io

  • 读写分离
  • Disk 1: Journal(WAL) Device
    • {timestamp}.txn
  • Disk 2: Ledger Device
    • 数据存放在多个ledger目录
    • LastLogMark表示index+data在此之前的都已经持久化到了Ledger Device,之前的WAL可以删除
    • 异步写
    • 而且是顺序写
      • 所有的active ledger共用一个entry logger
      • 读的时候利用ledger index/cache
  • [Disk 3]: Index Device
    • 默认Disk2和Disk3是在一起的
  • 在写入Memtable后,就可以向client ack了

IO被分成4种类型,分别优化

  • sync sequential write: shared WAL
  • async random write: group commit from Memtable
  • tail read: from Memtable
  • random read: from (index + os pagecache)

References

https://github.com/ivankelly/bookkeeper-tutorial
https://github.com/twitter/DistributedLog

Share Comments

SSD

Primer

Physical unit of flash memory

  • Page
    unit for read & write
  • Block
    unit for erase

物理特性

  • Erase before re-write
  • Sequential write within a block

Cost: 17-32x more expensive per GB than disk

ssd

Optimal I/O for SSD

  • I/O request size越好越好
  • 要符号物理特性
    • page or block对齐
    • segmented sequential write within a block

http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-summary-what-every-programmer-should-know-about-solid-state-drives/
http://www.open-open.com/lib/view/open1423106687217.html

Share Comments

oklog

injecter负责write优化(WAL),让storage node负责read优化
RocketMQ类似: CQRS

  • injecter = commit log
  • storage node = consume log

不同在于:storage node是通过pull mode replication机制实现,可以与injecter位于不同机器
RocketMQ的commit log与consume log是在同一台broker上的

  • kafka couples R/W,无法独立scale
  • CQRS decouples R/W,可以独立scale

produce

Producer通过forwarder连接到多个injecter上,injecter间通过gossip来负载均衡,load高的会通过与forwarder协商进行redirect distribution

query

scatter-gather

Share Comments