2017 kafka report

调查来自47个国家的388个组织(公司)
26%受访者年销售额10亿美金以上
15%受访者每天处理10亿消息/天
43%受访者在公有云上使用kafka,其中60%是AWS

report
usage

References

https://www.confluent.io/wp-content/uploads/2017-Apache-Kafka-Report.pdf

Share Comments

zookeeper processor

Chain of Responsibility
为了实现各种服务器的代码结构的高度统一,不同角色的server对应不同的processor chain

1
2
3
interface RequestProcessor {
void processRequest(Request request) throws RequestProcessorException;
}

LeaderZooKeeperServer.java
Leader
Leader

FollowerZooKeeperServer.java
Follower
Follower

ZooKeeperServer.java

1
2
3
4
5
6
7
func processPacket() {
submitRequest()
}
func submitRequest(req) {
firstProcessor.processRequest(req)
}

Share Comments

kafka redesign

Goals

  • support many topics
    • needle in haystack
  • IO optimization
    • R/W isolation
    • index file leads to random sync write
Share Comments

apache bookeeper

Features

  • 没有topic/partition概念,一个stream是由多个ledger组成的,每个ledger是有边界的
    createLedger(int ensSize, int writeQuorumSize, int ackQuorumSize)
  • ledger只有int id,没有名字
  • 每个entry(log)是有unique int64 id的
  • striped write: 交错存储
  • 各个存储节点bookie之间没有明确的主从关系
  • shared WAL
  • single writer
  • Quorum投票复制,通过存储在zk里的LastAddConfirmedId确保read consistency
  • bookie does not communicate with other bookies,由client进行并发broadcast/quorum

VS Kafka

vs kafka

createLedger时,客户端决定如何placement(org.apache.bookkeeper.client.EnsemblePlacementPolicy),然后存放在zookeeper
例如,5个bookie,createLedger(3, 3, 2)

IO Model

io

  • 读写分离
  • Disk 1: Journal(WAL) Device
    • {timestamp}.txn
  • Disk 2: Ledger Device
    • 数据存放在多个ledger目录
    • LastLogMark表示index+data在此之前的都已经持久化到了Ledger Device,之前的WAL可以删除
    • 异步写
    • 而且是顺序写
      • 所有的active ledger共用一个entry logger
      • 读的时候利用ledger index/cache
  • [Disk 3]: Index Device
    • 默认Disk2和Disk3是在一起的
  • 在写入Memtable后,就可以向client ack了

IO被分成4种类型,分别优化

  • sync sequential write: shared WAL
  • async random write: group commit from Memtable
  • tail read: from Memtable
  • random read: from (index + os pagecache)

References

https://github.com/ivankelly/bookkeeper-tutorial
https://github.com/twitter/DistributedLog

Share Comments

SSD

Primer

Physical unit of flash memory

  • Page
    unit for read & write
  • Block
    unit for erase

物理特性

  • Erase before re-write
  • Sequential write within a block

Cost: 17-32x more expensive per GB than disk

ssd

Optimal I/O for SSD

  • I/O request size越好越好
  • 要符号物理特性
    • page or block对齐
    • segmented sequential write within a block

http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-summary-what-every-programmer-should-know-about-solid-state-drives/
http://www.open-open.com/lib/view/open1423106687217.html

Share Comments

oklog

injecter负责write优化(WAL),让storage node负责read优化
RocketMQ类似: CQRS

  • injecter = commit log
  • storage node = consume log

不同在于:storage node是通过pull mode replication机制实现,可以与injecter位于不同机器
RocketMQ的commit log与consume log是在同一台broker上的

  • kafka couples R/W,无法独立scale
  • CQRS decouples R/W,可以独立scale

produce

Producer通过forwarder连接到多个injecter上,injecter间通过gossip来负载均衡,load高的会通过与forwarder协商进行redirect distribution

query

scatter-gather

Share Comments

db trigger

触发器的缺陷

  • 如何监控
  • 代码的版本控制
  • test
  • 部署
  • 性能损耗
  • 多租户
  • 资源隔离
  • 无法频繁发布,如何应付频繁的需求变更
Share Comments

materialized view

物化试图,可以理解为cache of query results, derived result
觉得用“异构表”可能更贴切

materialized view

与试图不同,它是物理存在的,并由数据库来确保与主库的一致性
它是随时可以rebuilt from source store,应用是从来不会更新它的: readonly

MySQL没有提供该功能,但通过dbus可以方便构造materialized view
PostgreSQL提供了materialized view

References

https://docs.microsoft.com/en-us/azure/architecture/patterns/materialized-view

Share Comments

cache invalidation

1
2
3
4
5
6
7
8
9
10
11
// read data
val = cache.get(key)
if val == nil {
val = db.get(key)
cache.put(key, val)
}
return val
// write data
db.put(key, val)
cache.put(key, val)

这会造成dual write conflict

如果需要的只是eventaul consistency,那么通过dbus来进行cache invalidation是最有效的

https://martinfowler.com/bliki/TwoHardThings.html

Share Comments

Linkedin Esprosso

What

Distributed Document Store

  • RESTful API
  • MySQL作为存储
  • Helix负责集群
  • Databus异步replicate不同数据中心commit log
  • Schema存放在zookeeper,通过Avro的兼容性实现schema evolution

References

https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store

https://nonprofit.linkedin.com/content/dam/static-sites/thirdPartyJS/github-gists?e_origin=https://engineering.linkedin.com&e_channel=resource-iframe-embed-4

Share Comments