In distributed computing jargon, properties are classified as either safety or liveness properties [1, 2]. Consistency is a typical safety property: the state of the system is never inconsistent for some definition of consistent. Of course, “never inconsistent” assumes an ideal world in which us developers do everything right in the code, which history has shown is rarely the case. There are other good examples of safety properties, like the agreement property for consensus protocols, which says that two processes do not decide differently. Interestingly, in replicated systems, we obtain consistent replicas by guaranteeing that they agree on what is being executed and applied.
But, it isn’t all about safety, is it? We also need to make progress. Being safe by procrastinating is rarely an ideal situation. In systems like ZooKeeper, it is very important to be able to elect a leader. If there is no leader, then the ensemble can’t update its state. In systems like BookKeeper, if ZooKeeper is not available, we can’t create or open a ledger. At this point, let me highlight the difference between safety and liveness. If a ZooKeeper ensemble can’t elect a leader, then it can’t process new updates, but the state is still consistent because either the previous leaders have done the right thing or the state is empty. Consequently, we can have a system that is in a safe state but isn’t live. In fact, the issue of safety and liveness for systems implementing consensus at their core is pretty delicate and fundamental. According to the FLP result [3], we cannot make sure that we have liveness without actually risking a violation of safety in a system that can’t differentiate a crashed server from a slow server. Being perceived as slow can be a side-effect of having some competing load running on the same machine, garbage-collection stalls, network load, etc.
I want to make the argument about liveness a bit more concrete, though. In particular, I want to point out that many parameters in distributed systems could lead to liveness issues inadvertently, and I’m going to use the initLimit parameter of ZooKeeper to illustrate the problem.
What does initLimit do?
The initLimit parameter is there exactly because of liveness. When a new leader is elected, the elected leader (we call it prospective leader regularly) needs to bring the ensemble up to speed before it can start serving new requests. Say that the prospective leader drops dead in the middle of the synchronization with the followers, before it even has a chance to start leading. It makes sense that the followers drop the leader and go elect a new one, and initLimit is supposed to bound the amount of time the servers expect this synchronization phase to take. If it takes longer, then the followers drop the leader and go back to leader election. The leader can also drop followers if it notices that they are falling behind.
But wait, what if the leader is taking long legitimately? What if snapshots are large and it really needs more time than initLimit is giving to synchronize? For example, say that initLimit is 10 and tick time is 2s. The total time the leader has to synchronize is 10 x 2 = 20s. Say the state is 2GB worth of data and that we can pull a snapshot out of the disk at a rate of 50 megabytes per second (this is just for illustration purposes). The leader needs 40s only to pull data out of the disk. In this case, we just need to increase the value of initLimit, right? Increasing the value of initLimit solves the problem temporarily. It is not unlikely that the value of initLimit will be changed again in the future as the amount of data grows. Of course, there are brute force solutions, like:
- Making the maximum heap size small so that we hit an out of memory error before the initLimit value gives us a false positive;
- Making the value of initLimit very large, which is likely to work, but it’ll also make initLimit ineffective.
We could also consider having the leader reporting progress periodically with pings, but even if a follower is receiving pings periodically, for how long should it keep receiving these reports? What if the leader is late on sending the reports? How long should a follower wait even for the pings? We still need some estimate of time and cannot really get rid of a time parameter like initLimit.
A solution
A solution is to calibrate initLimit dynamically. The general idea is to estimate the number of bytes the leader needs to send to followers in the case it needs to send a snapshot. Recall that a leader typically sends only a diff of transactions, but in some cases it might need to send a whole snapshot of the state if the follower is lagging behind. If we know the rate at which a leader is supposed to process bytes and send them to followers, we can estimate a value for initLimit by dividing the state size by such rate. There are three points that are complicated here, however. First, how do we even determine what such a byte rate is? Network speed is a good estimate, but there is the processing of the leader and background network traffic as well, so it is unlikely that the leader is able to send at line speed. Second, the leader needs to be able to synchronize with a quorum before it starts. The time to coordinate with a quorum must be taken into account somehow. Third, the followers need to know the value of initLimit so that they can apply locally. Currently, initLimit is set via configuration and servers do not exchange their values. Note that the value at each server can be set differently.
Stepping back for a moment, the goal is to set initLimit to a value that is fair for a prospective leader. It is fair in the sense that it should be enough time for the leader to synchronize with a quorum and to not be perceived as being slow because of a large state. The consequence of not calibrating or setting the value of initLimit accurately is either false positives (aggressive) or false negatives (conservative). The simplest way to do this calibration is to check the size of the latest valid snapshot on disk and divide it by some conservative constant for the byte rate. For example, we can assume that a leader should be able to process bytes at a rate of 100 Mbits/s, which is on the low side for the current technology, but gives us a lower bound on the amount of work a prospective leader should be able to perform without being perceived as slow. The drawbacks of such a solution? There are a couple:
- It is hard to tell if the specified byte rate is adequate. Making it too conservative will make a ZooKeeper ensemble wait a longer time to recover from a dead or slow leader. If the byte rate is set too large, then a prospective leader might not be able to synchronize with a quorum. The original initLimit parameter was designed to be conservative, however. A conservative estimate is ok for many cases, since it enables progress in scenarios in which a static initLimit value doesn’t.
- When a leader starts, its latest snapshot includes all transactions in its transaction log. Consequently, looking at just the snapshot is pretty accurate. After the leader becomes established, the leader might have proposed/accepted some transactions that are not reflected in the latest snapshot. If a follower tries to synchronize with the leader at this point and the leader estimates initLimit using only the snapshot size, then the estimate is not accurate because it does not include transactions committed after the snapshot was taken. For practical purposes, the difference should not be large unless the recent transactions are mostly large writes.
To improve the estimate, we can consequently do two things: estimate the byte rate instead of setting it manually and improve the state size estimate.
Back to liveness (or absence of)
Although I’ve focused quite a bit on the initLimit parameter, the core goal of this post is to bring some attention to liveness in distributed systems. There is a lot of focus on safety (for obvious reasons), but I’ve seen little discussion about liveness. There is of course the A of CAP which is about liveness [4], but the CAP availability refers to the ability of processing requests according to failures and network partitions. Here I refer to scenarios in which the system is not live and yet there is no failure or network partition.
The absence of liveness comes from a parameter being set to a small value compared to what the value should really be. The key observation is that some parameters related to timing might lead to liveness issues. A potentially better solution compared to just setting a fixed value is to calibrate such parameters dynamically, but doing it is not entirely trivial. It requires estimates and some guesswork. For initLimit specifically, there is a jira open for this particular issue (ZOKEEPER-1977) and we at Microsoft have done some preliminary work to get it adjusted dynamically. We should be posting a patch some time soon.
Acknowledgements
Thanks to Martin Kleppman and Raúl Gutiérrez Segalés for comments on a draft of this post. I’d like to also acknowledge the contributions of my Microsoft buddies Ian Dimayuga and Greg Foxman on the initLimit work and thank them for discussions and reviews. Keep rocking, guys!
References
[1] Lamport, L., “Proving the Correctness of Multiprocess Programs“, IEEE Transactions on Software Engineering, Volume SE-3, Issue 2, pp. 125-143, March 1977.
[2] Alpern, B. and Schneider, F., “Recognizing safety and liveness“, Distributed Computing, Volume 2, Issue 3, pp. 117-126, 1987.
[3] Fischer, M., Lynch, N., and Paterson, M., “Impossibility of Consensus with One Faulty Process“, Journal of the ACM, Volume 32, Issue 2, pp. 374-382, April 1985.
[4] Gilbert, S. and Lynch, N., “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services“, ACM SIGACT News, Volume 33, Issue 2, June 2002.