Database as a Service

2017-05-31

database

Trove

https://wiki.openstack.org/wiki/Trove

DB Schema Management

2017-05-31

database

https://github.com/skeema/skeema

统一控制dev/test/staging/prod等环境的scheme

$skeema
Skeema is a MySQL schema management tool. It allows you to export a database
schema to the filesystem, and apply online schema changes by modifying files.
Usage:
      skeema [<options>] <command>
Commands:
      add-environment  Add a new named environment to an existing host directory
      diff             Compare a DB instance's schemas and tables to the filesystem
      help             Display usage information
      init             Save a DB instance's schemas and tables to the filesystem
      lint             Verify table files and reformat them in a standardized way
      pull             Update the filesystem representation of schemas and tables
      push             Alter tables on DBs to reflect the filesystem representation
      version          Display program version

与mysql在线alter table设计不同，它是higher level的，底层仍旧需要OSC支持:

1	alter-wrapper="/usr/local/bin/pt-online-schema-change --execute --alter {CLAUSES} D={SCHEMA},t={TABLE},h={HOST},P={PORT},u={USER},p={PASSWORDX}"

References

https://www.percona.com/live/17/sessions/automatic-mysql-schema-management-skeema

Cassandra vs ScyllaDB

2017-05-31

storage

Cassandra

Cassandra 项目诞生于 Facebook，后来团队有人跳到 Amazon 做了另外一个 NoSQL 数据库 DynamoDB。
最开始由两个facebook的员工最开始开发出来的，其中一个还直接参与了Amazon的Dynamo的开发。

Dynamo论文发表于2007年，用于shopping cart
Cassandra在2008年被facebook开源，用于inbox search
Uber现在有全球最大的Cassandra data center

Features

CQL(Cassandra Query Language)
1,000 node clusters
multi-data center
out-of-the-box replication
ecosystem with Spark

Internals

LSM Tree
Gossip P2P
DHT
consistency same as DynamoDB
- ONE
- QUORUM
- ALL
- read repair
Thrift

ScyllaDB

KVM核心人员用C++写的Cassandra(Java) clone，单机性能提高了10倍，主要原因是：

DPDK, bypass kernel
O_DIRECT IO, bypass pagecache, cache由scylla自己管理
- pagecahce的格式必须是文件的格式(sstable)，而app level cache更有效，更terse
- compaction的时候，pagecache讲是个累赘，它可能造成很多热点数据的淘汰
把一个node看做是多个cpu core组成的cluster, share nothing
sharding at the cpu core instead of node
更充分利用多核，减少contention，充分利用cpu cache, NUMA friendly
在需要core间交换数据时，使用explicit message passing
avoid JVM GC

References

https://db-engines.com/en/ranking
https://github.com/scylladb/scylla
http://www.scylladb.com/
http://www.seastar-project.org/
https://www.reddit.com/r/programming/comments/3lzz56/scylladb_cassandra_rewritten_in_c_claims_to_be_up/
https://news.ycombinator.com/item?id=10262719

etcd3 vs zookeeper

2017-05-25

一致性

etcd v3独有的特性

get and watch by prefix, by interval
lease based TTL for key sets
runtime reconfiguration
point in time backup
extensive metrics
获取历史版本数据(这个非常有用)
multi-version

mini transation DSL

Tx.If(
    Compare(Value("foo"), ">", "bar"),
    Compare(Value(Version("foo"), "=", 2),
    ...
).Then(
    Put("ok", "true")...
).Else(
    Put("ok", "false")...
).Commit()

leases

l = CreateLeases(15*second)
Put(foo, bar, l)
l.KeepAlive()
l.Revoke()

watcher功能丰富
- streaming watch
- 支持index参数，不会lose event
- recursive
off-heap
内存中只保留index，大部分数据通过mmap映射到boltdb file
incremental snapshot

zk独有的特性

ephemeral znode
non-blocking full fuzzy snapshot
Too busy to snap, skipping
key支持在N Millions
on-heap

etcd2

etcd2的key支持在10K量级，etcd3支持1M量级
- 原因在于snapshot成本，可能导致0 qps，甚至reelection

comparison

memory footprint

2M 256B keys

1
2
3

etcd2 10GB
zk    2.4GB
etcd3 0.8GB

References

https://coreos.com/blog/performance-of-etcd.html

2017 kafka report

2017-05-25

PubSub

调查来自47个国家的388个组织(公司)
26%受访者年销售额10亿美金以上
15%受访者每天处理10亿消息/天
43%受访者在公有云上使用kafka，其中60%是AWS

report
usage

References

https://www.confluent.io/wp-content/uploads/2017-Apache-Kafka-Report.pdf

zookeeper processor

2017-05-25

一致性

Chain of Responsibility
为了实现各种服务器的代码结构的高度统一，不同角色的server对应不同的processor chain

1
2
3

interface RequestProcessor {
    void processRequest(Request request) throws RequestProcessorException;
}

LeaderZooKeeperServer.java
Leader
Leader

FollowerZooKeeperServer.java

ZooKeeperServer.java

func processPacket() {
    submitRequest()
}
func submitRequest(req) {
    firstProcessor.processRequest(req)
}

kafka redesign

2017-05-24

PubSub

Goals

support many topics
- needle in haystack
IO optimization
- R/W isolation
- index file leads to random sync write

apache bookeeper

2017-05-23

PubSub

Features

没有topic/partition概念，一个stream是由多个ledger组成的，每个ledger是有边界的
createLedger(int ensSize, int writeQuorumSize, int ackQuorumSize)
ledger只有int id，没有名字
每个entry(log)是有unique int64 id的
striped write: 交错存储
各个存储节点bookie之间没有明确的主从关系
shared WAL
single writer
Quorum投票复制，通过存储在zk里的LastAddConfirmedId确保read consistency
bookie does not communicate with other bookies，由client进行并发broadcast/quorum

VS Kafka

vs kafka

createLedger时，客户端决定如何placement(org.apache.bookkeeper.client.EnsemblePlacementPolicy)，然后存放在zookeeper
例如，5个bookie，createLedger(3, 3, 2)

IO Model

读写分离
Disk 1: Journal(WAL) Device
- {timestamp}.txn
Disk 2: Ledger Device
- 数据存放在多个ledger目录
- LastLogMark表示index+data在此之前的都已经持久化到了Ledger Device，之前的WAL可以删除
- 异步写
- 而且是顺序写
  - 所有的active ledger共用一个entry logger
  - 读的时候利用ledger index/cache
[Disk 3]: Index Device
- 默认Disk2和Disk3是在一起的
在写入Memtable后，就可以向client ack了

IO被分成4种类型，分别优化

sync sequential write: shared WAL
async random write: group commit from Memtable
tail read: from Memtable
random read: from (index + os pagecache)

References

https://github.com/ivankelly/bookkeeper-tutorial
https://github.com/twitter/DistributedLog

SSD

2017-05-22

storage

Primer

Physical unit of flash memory

Page
unit for read & write
Block
unit for erase

物理特性

Erase before re-write
Sequential write within a block

Cost: 17-32x more expensive per GB than disk

ssd

Optimal I/O for SSD

I/O request size越好越好
要符号物理特性
- page or block对齐
- segmented sequential write within a block

http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-summary-what-every-programmer-should-know-about-solid-state-drives/
http://www.open-open.com/lib/view/open1423106687217.html

oklog

2017-05-22

PubSub

injecter负责write优化(WAL)，让storage node负责read优化
与RocketMQ类似: CQRS

injecter = commit log
storage node = consume log

不同在于：storage node是通过pull mode replication机制实现，可以与injecter位于不同机器
而RocketMQ的commit log与consume log是在同一台broker上的

kafka couples R/W，无法独立scale
CQRS decouples R/W，可以独立scale

produce

Producer通过forwarder连接到多个injecter上，injecter间通过gossip来负载均衡，load高的会通过与forwarder协商进行redirect distribution

query

scatter-gather