Dude, where’s my metadata?

This post is about silent data loss in replicated systems (state-machine replication a la ZooKeeper) due to the disk state being wiped out. The disk state is crucial in such systems to guarantee that replicated data isn’t lost in the case a server crashes and recovers. The replication protocol in ZooKeeper assumes that servers can crash and recover, and the ensemble can make progress as long as f + 1 members are up and running, where f is a bound on the number of faulty servers determined by the size of the ensemble. If the ensemble has 5 servers, then the ensemble can make progress as long as 3 are up (f = 2). It additionally assumes that the disk state before a server crash is there during recovery. This post is pointing out that this isn’t always the case, and it is an issue to be aware of. Note that this isn’t unique to ZooKeeper, but I’ll focus on ZooKeeper because I know it well.

The disk state can be wiped out for various reasons like faulty drives, misconfiguration, starting a server replica in a different machine virtual or not. Losing disk state might be silent at times for due to software, hardware, or operator errors. This issue has come up occasionally in the community, but to my knowledge hasn’t been documented or discussed extensively.

Here is the problem that losing disk state really induces. Say we have an ensemble of ZooKeeper servers comprising 3 servers: A B C. A leads and B follows; C is dormant. Now say that A and B commit a transaction, say 0x1 – “create znode /foo”. Recall from ZooKeeper 101 that before committing a transaction, a quorum of servers needs to persist it to disk by writing to the transaction log.

It all looks good until say A crashes and its disk state is gone. Now A wakes up and it has no memory of committing 0x1, so it thinks it is starting from scratch. A next forms a quorum with C, and C hasn’t heard anything about 0x1. A and C elect A to lead and declare the empty state to be the new state. In the meanwhile, B is clueless waiting to talk to other servers to form a quorum. When it does, it learns that the new state doesn’t contain 0x1. Transaction 0x1 is now lost forever…

This figure illustrates the execution I just described:


Getting around it: Increase replication

Fundamentally, this is a problem with the assumption that servers only crash in the case they are faulty. Losing this state looks more like an arbitrary fault (or a Byzantine fault) and could be handled as such. If instead of using simple majorities as quorums (more than half of the replicas) we use super majorities (more than 2/3 of the replicas), then we can guarantee that we have enough replicas. To see why it works, convince yourself that the smallest intersection of any two simple majorities has one server, while for super majorities we have always f + 1 in any intersection.

Having a single server in the intersection between quorums causes the problem above because the single server can crash, lose its disk state due to a faulty disk, and recover memoryless. Having f+1 servers in the intersection guarantees that as long as no more than f servers get a faulty disk simultaneously, any intersection between two quorums will always contain at least one server that hasn’t lost its disk state and the ensemble can recover the state it has previously replicated through that server in the interesection.

Using super majorities has one important drawback, which is requiring 3f + 1 servers to tolerate f faulty servers. To tolerate 2 faulty servers, we now need 7 servers rather than 5. Consequently, using 7 servers with super majorities, we can only tolerate 2 servers crashing rather than 3 servers as we have with simple majorities. That’s not great, so can we deal with this problem of silent data loss without having to resort to super-majorities?

Second cut: let the operator decide

An even simpler approach is to not restart servers automatically and perform a sanity check over the disk data before restarting the server. In the case the server does lose its disk state, it could be removed and reintroduced to the ensemble via reconfiguration. The reconfiguration takes care of rebuilding the state onto the server recovering so that it can start making progress with the rest of the ensemble.

Note that for the reconfiguration to work the ensemble must have enough replicas up to be able to make progress without the faulty server. In the example above, A can’t be reintroduced if B doesn’t start talking to C again. Getting B back might require an operator to intervene. But, what if B has also lost disk state? In this case, we have violated our f bound, which is f = 1 for the above example, and the ensemble is stuck, it can’t reconfigure A and B back in.

Third cut: using a token

The assumption that no more than f servers can be faulty and recovering simultaneously is obviously fine, but we have asked the question of whether we could automate even further by enabling progress when we have too many faulty servers, even if some data is lost in some rare cases.

One way is to use a token, a database identifier. Such identifiers have been used extensively in commercial databases like Oracle. Oracle has a create database command that generates a dbid. We would like the token generation to be automated instead of relying a command issued, however.

As with Oracle databases, the token represents an instance of the ZooKeeper database, and a server can only join an ensemble in the case it has the same token as everyone else. The servers consequently need to agree on a common token. Servers need to verify the token upon a handshake with the leader, possibly even during leader election, to make sure they refer to the same instance. They should also enable clients to determine if they are talking to the right incarnation.

What makes the token agreement problem interesting is that we need to reach agreement to reach agreement: Zab gives us agreement but we need to agree upon a token to tolerate disk failures. Instead of trying to solve the problem recursively, we simplify the solution by fixing a token distributor and requiring unanimity. A chosen server, not elected dynamically, is responsible for distributing new tokens, and it does it in a 2PC manner, requiring all other servers to acknowledge and commit.

As long as a majority of servers keeps the same token there is no need to generate a new one. In the case any server loses its disk, it recovers by adopting the token provided by a majority.  If no majority holds a token, then the token distributor does its job and generates a new one. Given that it requires unanimity, the token generator should wait until a majority explicitly declares not to have a token. Otherwise, because of timing, it could end up generating new tokens and instantiating the database unnecessarily.

An invariant of the system is that the token a message of the replication protocol carries always matches the token of the receiver, except for the case in which it doesn’t have a token and it is recovering. Servers not bearing a token shouldn’t participate in the protocol.

Back to the initial example, how does this token scheme solve the problem? When A comes back up and connects to C, it won’t have a token so A and C won’t be able to form a quorum and get going with processing requests. Server A needs to wait for B to say what its token is so that it can make progress. Now that A knows that there is a quorum with the same token, it can adopt the token by pulling a copy of the state of B. Why B? Because B has the latest state (latest epoch/zxid).

Let’s be upfront about the weak points of this scheme. First, it isn’t great to have a fixed distributor, but trying to elect one would complicate things and would essentially be a “solving consensus to solve consensus” kind of direction that wouldn’t work. Second, unanimity leaves no room for masking crashes. If any process is unavailable, then the token distribution protocol can’t terminate as described. Both assumptions are OK in the case new tokens aren’t distributed often, though. In fact, it really shouldn’t happen often because a new token represents a new instance of the database, which possibly means lost data. We might be able to copy some data from the previous instance, but there is no guarantee that it will contain everything that has been committed.

The history of the database id (dbid) in ZooKeeper actually dates back to the early days of the project [1]. We haven’t finished incorporating the dbid although the field is present in the persistent state of ZooKeeper. We recently started talking about this feature in the context of virtualization because servers might be restarted in a different host without carrying the disk state, in which case we wanted to avoid the problem mentioned here. Reconfiguration is also a topic I’ll leave for a future post. It affects the unanimity assumption because the set of servers could change dynamically.

Any conclusion?

Reconfiguration already exists in the 3.5 branch of ZooKeeper, so the option of reconfiguring a server in is viable. I also don’t see a strong reason for not automating this process of checking the disk state and reconfiguring if necessary, it sounds doable. The token scheme provides an additional notch of automation, but it will take some effort to implement. The use of super majorities was supposed to be simple with quorum verifiers, but when I tried to implement a verifier for super majorities, I ran into all sort of problems because of the dependency on simple majorities across the code. It sounds like the initial effort to abstract away the type of quorum system we use wasn’t really continued. I haven’t put much effort into getting this one to work, but it shouldn’t be too hard to fix it. I expect it to require some refactoring.

Acknowledgments: The token distribution scheme mentioned here has been discussed and developed with Johannes Klein and Satish Thatte. Thanks to Camille Fournier, Martin Kleppman, Patrick Hunt, and Bill Bridge for comments on a draft version of this post. It has changed quite a bit due to the comments.


[1] Apache ZooKeeper Jira – Unique DB identifiers for servers and clients

4 thoughts on “Dude, where’s my metadata?

  1. I don’t think it’s necessarily the case that transaction 0x1 is lost because only B still had it. You say that a transaction can only be committed when a quorum have logged it, but that’s a bit ZK-specific. In other systems, an operation can only be *reported as successful* after a quorum have logged it, but during subsequent recovery it’s only necessary for one node to have it. Thus, when B does rejoin, the state with one transaction (B’s) supersedes the state (A’s and C’s) with zero even though it’s a minority. This does create a possibility that an operation not reported as successful actually did have a permanent effect, but it also avoids lost operations in the scenario you describe. For many systems this is an acceptable tradeoff, especially since false positives are always possible because of the “lost ack” problem anyway.

    If A and C complete new operations before B rejoins, that creates a whole different problem. If the operations are commutative then the situation is still resolvable, but otherwise it might not be. Throw in client retries for operations perceived as failed, and you have a real mess on your hands. Still, it’s possible to do better than to throw 0x1 away just because of a disk failure on A.


  2. Jeff, It is sufficient in ZK too that a single node has it, but we need at least one node in each quorum to have it. Neither A nor C has it when they form a quorum, and B can take arbitrarily long to talk either A or C again. A anc C need to agree on a prefix of the state updates before they can start to make progress again, and once they do, they will start appending state updates to the transaction log. When B comes back and show up with 0x1, A and C can’t simply reinsert 0x1 back into the log or re-execute it because the state updates are causally dependent. There could be side-effects on the data that clients have already observed. But sure, we could roll back and reinsert 0x1 ignoring what clients have done. Given that only B has it, and we know it because otherwise it would have appeared in the A C quorum due to the intersection property, should we really roll back to redo the work every time we see such a transaction that we are not sure has been committed? It sounds complex and messy.

    Note that the example doesn’t show A and C making progress to simplify the discussion, so here I’m extending it by saying that A and C have processed state updates.

    Finally, is this zk-specific? It depends on what you compare to. I’d claim that all Paxos-like systems suffer from such an issue because they assume that storage is stable. Part of the message of this post is that the fault model of such systems doesn’t accomodate disk failure or simply the disk state being wiped-out.


  3. Flavio, I think you’re conflating two different things: presence of conflicting writes (log entries) vs. absence of concurring ones. The first is obviously problematic. The second, by itself, need not be. If A and C know that no writes have been performed since last contact with B – as must be clear from versions or similar that are still present at least on C – then there’s no reason to reject B’s write no matter how much time has passed.

    “Note that the example doesn’t show A and C making progress to simplify the discussion, so here I’m extending it by saying that A and C have processed state updates.”

    That’s well and good, but that is not the problem you initially presented and it’s not what I was responding to. The no-conflicts-since case is common enough in practice to be worthy of consideration and proper handling even if the conflict case remains intractable.


  4. A basic premise of ZK and other consensus-like systems is that the replicas agree on a unique sequence of state updates/commands. Once a given state update/command is agreed upon in a position of the sequence, that’s irrevocable. You could definitely imagine systems that relax this assumption, though, but that’s not what we do currently.

    As for the example, if you keep in mind that in the current model a state update can’t be moved to a different position later in the sequence, then the example in the post illustrates the problem. We would need a different protocol to take conflicts into account and I believe you’re also pointing out that even if we do it, conflicts might prevent us from bringing such lost transactions back into the sequence without breaking semantics. I like your observation, though. Thanks for making the point.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s