This post is about the replication scheme we use in Apache BookKeeper. It is mostly about why we did it this way rather than how to use it. There are other sources of information about how to use BookKeeper.
When we started thinking about the design of BookKeeper and the replication scheme, we had some experience with the one we had used for ZooKeeper (its older sibling). ZooKeeper implements a replicated state machine using the Zab protocol, which enables replicas to agree on the changes to the ZooKeeper service. This agreement is not only about the changes themselves, but also about the order, so we need both agreement on the changes to the ZooKeeper state (znodes and sessions to be more concrete) and the order in which they are applied to the distinct replicas.
The protocol in the common case (ignoring leader election and changes of the leader) is fairly simple and looks like this:
A leader proposes a new state change, the followers persist to the transaction log and acknowledge it back. Upon receiving acknowledgements from enough replicas (a quorum out of followers and leader), the leader commits by sending out commit messages. Once a replica gets a commit, it applies the state change (e.g., a new znode is created) and it is ready to serve it (e.g., a getData for that znode).
One interesting observation out of the figure above is that the leader needs to propagate that commit message because each replica needs to learn that the state change has been replicated across a quorum of replicas before it starts serving it. It doesn’t matter what the quorum actually is, just that there are enough replicas out there. The replicas need to learn about correctly replicated state changes because clients of the service can read the state of ZooKeeper from any server, and all healthy servers need to be ready to respond to read requests.
A couple of details to note, which are not that important for this discussion, but I’ll add here for completeness. Requests to change the state of ZooKeeper can be really submitted directly to the leader or to any follower. If it hits a follower, then the follower forwards it to the leader. The second point is about the commit step. Instead of having a commit step in which followers only send back to the leader, we could have servers sending an acknowledgement to everyone else, in which case servers would equally learn that a state change has been correctly replicated. One disadvantage of this option is that it increases the message complexity. If my math is not wrong, then with 5 servers we have 20 messages total for this 2-step approach compared to 12 messages with the 3-step approach we use. The other disadvantage is that everyone needs to talk to everyone else in the 2-step approach, while in the 3-step approach we only need leader <-> follower communication; it simplifies the implementation.
With this replication scheme in mind, let’s now see what we have in BookKeeper. BookKeeper stores and replicates ledgers. Think of ledgers as transaction logs: they can only be appended to, they are in the critical path of apps, and they need to be safely stored. For each ledger, there is just a single writer client and the writer directly replicates the entries (byte arrays) across storage nodes we call bookies. There is no leader like in ZooKeeper, but the writer behaves somewhat as a leader, so we have essentially one leader per ledger.
Because the writer is the only one that needs to learn that an entry has been correctly replicated, we have a similar pattern as the one we had for ZooKeeper, except that we can completely eliminate the commit step:
Even cooler, for the same reason, we don’t have to add entries always to the same set of bookies. We can spread entries across different sets of bookies like this:
which gives us parallelism when adding entries.
Ok, it’s cool that we can eliminate a step, but now there is a handicap. We write stuff into the bookies, but we might want to retrieve that data eventually, right? Since the bookies don’t know whether a record has been correctly replicated, we need to read them from multiple bookies.
Reading from multiple bookies is necessary, but not sufficient to give us a consistent view into a ledger. Say we read the entries of a ledger while the writer is still finishing a write. In this case, if the writer is half-way through a write of an entry e, one client might end up reading e while another client doesn’t. Consequently, we need a way for reader clients to learn what entries have been correctly replicated in a ledger.
To get reader clients to learn what has been correctly replicated, we need some mechanism that enables readers to quickly determine what those entries are whenever they come. The simplest way to implement this feature was to use ZooKeeper: the writer writes to ZooKeeper the identifier of the last entry that has been correctly replicated. To avoid writing to ZooKeeper a lot, we use a trick, and instead of writing to ZooKeeper upon every entry we add to a ledger, we only do it when we close the ledger. Doing it upon closing a ledger only implies that two readers can only really agree on the content of a ledger once the ledger is closed. It makes sense because while the ledger is being written to, the readers can observe the ledger at different times and with different content.
Of course, BookKeeper allows clients to read from a ledger before it is closed, but the way to do it is more of an API discussion, and I won’t cover it here.
One important conclusion of this discussion is that the replication protocol we use for BookKeeper, ignoring the presence of ZooKeeper, isn’t a complete replacement for Zab. Zab provides stronger guarantees and in this particular scenario complements the BookKeeper protocol. The BookKeeper use case, however, is an example of a replication scheme different from the one in ZooKeeper and other consensus-based services that exploits the nature of this reliable log store application to be more efficient. The efficiency comes from a simpler protocol for the common case and parallel access to the data of ledgers.