2 phase commit failures

Node Failure Models

  • fail-stop
    crash and never recover
  • fail-recover
    crash and later recover
  • byzantine failure

Cases

2 phase commit,n个节点,那么需要3n个消息交换

  • coordinator发送proposal后crash

    • 有的node收到,有的没收到
    • 收到Proposal的node被block forever,它可能已经vote commit了
      不能简单地timeout/abort,因为coordinator可能随时recover并启动phase2 commit
      这个txn就只能blocked by coordinator,cannot make any progress
    • 解决办法
      引入coordinator的watchdog机制,它发现coordinator crash后,接管
      Phase1. 先询问每个participants,已经vote commit还是vote abort还是没有vote
      Phase2. 通知每个participant Commit/Abort
      但仍有局限,如果有个participant crash了,那么Phase1无法确认
  • worse case
    coordinator本身也是participant

3PC

在propose和commit这2个phase中间,加了个prepare to commit

  • 如果coordinator在prepare to commit或者proposal阶段crash
    trx aborted
  • 如果coordinator在commit阶段crash
    nodes will timeout waiting for the commit phase and commit the trx
Share Comments