No consensus in exactly-once

Exactly-once semantics is an intriguing, controversial, and for me an exciting topic. I have been dealing with it in the recent past across different systems and one particular issue that got me interested is the connection I have seen a few people make to FLP, the distributed consensus impossibility result [1]. I argue here informally that this connection is not correct, which is not supposed to be a final word on exactly once, but it might shed some light on terminology and properties.

Deriving and understanding impossibility results are as important as deriving distributed algorithms. Impossibility results set the limits of what can and cannot be done. In the case of consensus, the FLP result states that the consensus problem cannot be solved in a purely asynchronous system; an asynchronous system is one with no time bounds on processing and message propagation. Although the proof requires a bit of thinking, the essence is that in an asynchronous system, we cannot really tell whether a process is crashed or just slow to respond. In some situations, under such uncertainty about other processes, we could end up violating agreement among processes if a process takes a step to guarantee termination.

To make any argument about the connection between exactly once and consensus, we need to be able to define both. The definition of consensus using the validity, agreement, termination properties is well know, but what about exactly once? What is supposed to happen exactly once? Is it delivery or something else? Before I get into definitions, I’d like to first explain the context in which I have seen others refer to exactly once.

In the following sections, we cover the following:

  • The context in which exactly once is often brought up and some different ways that it can be interpreted, which can overlap, but are not necessarily equivalent.
  • The (dis)connection between exactly-once delivery and consensus. I argue that exactly-once delivery and consensus are not equivalent and conclude that as such the FLP impossibility result does not apply to exactly-once delivery alone.

A few ways of thinking about exactly-once semantics

The discussion in this section is not supposed to be exhaustive, but it illustrates the context in which I have seen exactly-once brought up recently. I’m not covering remote procedure calls (RPCs) explicitly, although it is a valid example where exactly-once guarantees apply as observed in the work of Spector [2] and Birrell [3].

Idempotence

In a number of scenarios, systems claim to satisfy exactly-once semantics when they mean to say that they avoid duplicates. We can think of an arbitrary data structure over which we want to avoid applying the same transformation multiple times. For example, adding the same element multiple times to a multiset, incrementing a counter multiple times, appending the same message to a log multiple times, etc.

To avoid duplicate transformations, we can keep a history of all requests that have been submitted and only apply a transformation in the case the data structure does not reflect the transformation.  In some cases, we can use sequence numbers to spot duplicates and reject them rather than keep a history of transformations around. For example, if we are implementing a distributed log, then it is desirable not to have a second log around for each log instance.

With such data structures, the property we want is idempotence as we want to avoid transformations being executed multiple times. Idempotence makes our data structure at-most-once, but what about the at-least-once part. There is nothing really we can do to make a strong guarantee about it, unless we are willing to block indefinitely to avoid a violation of safety, which reminds me of the FLP proof in which we violate agreement because a process takes a decision step in a bivalent configuration to satisfy termination. Retransmitting typically has it executed at least once, but in some hopefully rare scenarios, we need to either give up and move on, or error permanently.

Note that the argument up to this point has nothing to do with how many times the data of the structure is read. For example, if I have a distributed log that is deduplicated, there is nothing that prevents me from reading or delivering the items in the  log multiples times. The scheme only makes sure that the log has no duplicates. To avoid multiple deliveries or duplicates downstream, then we need something else.

Delivery

Delivery appears typically in the context of message-passing systems. A process p sends a message, while a process q delivers it. Delivering a message m means that q receives the message and processes it fully with some procedure f(m), which is part of some application.

In the presence of crashes and assuming that the execution of f is stateful, the subsystem that calls f has no way to know for sure whether an invocation of has completed successfully or not. The procedure f needs to either be idempotent or be able to roll back to guarantee that the result of processing m is not reflected multiple times in the state of f.

One of the issues with guaranteeing that f is executed in an idempotent manner is to make sure that reprocessing a message upon recovery is not reflected twice in the application state. To avoid such a problem, the application can recover from a snapshot of the state and request that the delivery subsystem resumes from the message immediately after the last transformation in the snapshot. This technique is used in the implementation of replicated state machines, where the replicated log is used to replay transformations over the latest snapshot. Outside recovery, duplicates can be identified via sequence numbers or other form of unique identifiers.

The bottom line is that the messaging subsystem alone, be it a pub-sub, a message bus, or something else, cannot guarantee that delivery of a message happens exactly once or that it is reflected in the state of procedure f in an idempotent manner without application assistance. Application here refers to the implementation of the procedure f, which is outside the scope of the messaging subsystem.

Transactions

For applications that have complex transformations, idempotence alone does not provide a strong guarantee. For example, if a transformation out of processing a message m implies two separate changes, s1 and s2, then it is possible that a crash hits the process right between s1 and s2. Reprocessing m implies that s1  is going to be duplicated, undesirably. At the same time, we need to process m again because the s2 change is missing. Consequently, we need the ability to roll back or abort partial changes. We could also make the application of these changes idempotent so that s1 is not reflected twice, but there are at least two reasons to consider transactions in the presence of such complex transformations:

  1. The computation is non-deterministic and there is a temporal dependency between s1 and s2  so we need to recompute s1;
  2. The application does not want to expose partial results, so we either make it all visible or nothing (atomicity).

Non-determinism may arise for a number of reasons, e.g., the input of the computation changes over time, the computation depends on some external source that changes over time, etc. There is some illustration and more insight around this problem in this slide deck from Strata London 2016.

This use of transactions can be applied both in the case of data structures above or with the procedure in the discussion around delivery, assuming the application is able to accommodate such a transactional processing.

A word on the distributed consensus problem

Let’s switch gears and talk about exactly-once and the distributed consensus problem. To make sure that this text is accessible even to the ones who do not know consensus intimately, here we discuss briefly what consensus is about and give some references.

The consensus problem consists of a set of processes, each one with an initial value, such that each proposes its own value and they want to decide on a single one of those values. For example, if three processes p1, p2, p3 propose 0, 0, 0, respectively, then the only possible decision value is 0. If instead they propose 0, 0, 1, then both 0 and 1 are possible decision values.

Consensus is at the core of replicated state machines, which is a technique for implementing fault-tolerant services [4]. To implement such a service, we have replicas of a service executing the same requests in the same order. To guarantee that the replicas execute the same requests, in the same order, we implement a sequence of consensus instances. In each position of the sequence, we decide which requests go into out of the all the ones being proposed via an execution of consensus. This sequence of consensus instances is in rough terms what we know as atomic broadcast.

One known (meta-)protocol for implementing replicated state machines is Paxos [5]. Paxos per se does not give us consensus directly as defined above, but it is relatively straightforward to go from the protocol to implementing consensus with it, once we understand the protocol. There is a learning curve in understanding the protocol, though. Other protocols have been proposed for implementing replicated state-machines based or inspired on Paxos, e.g., Raft [6] and Zab [7].

In some more detail, consensus is defined with three properties:

  • Agreement: If processes p, q ∈ Π are both correct and p decides v, then q also decides v
  • Termination: If process p ∈ Π is correct, then p eventually decides some value v
  • Validity: If a process p decides v, then v is the initial value of some process in Π

where Π is the set of processes. The atomic broadcast protocol is defined as a type of reliable broadcast [8, 9]:

  • RB-Validity. If a correct process broadcasts a message m, then it eventually R-delivers m.
  • RB-Agreement. If a correct process delivers a message m, then all correct processes eventually deliver m.
  • RB-Uniformity. For any message m, every process delivers m at most once, and only if m was previously R-broadcast by sender(m).

that additionally satisfies total order:

  • AB-Total order. If two correct processes p and q deliver two messages m and m’, then p delivers m before m’ if and only if q delivers m before m’.

The literature on consensus and broadcast protocols is vast. The references at the end of this post give some direction and more detail on what is being discussed here.

Exactly once and Consensus: Equivalent?

Let’s now compare consensus and exactly-once delivery. Consensus is known to be equivalent to atomic broadcast [8]. Atomic broadcast guarantees agreement on the set of messages delivered, but also guarantees that all processes deliver messages in the same order. Wait, I said order… Does exactly-once say anything about order?  The exactly-once property does not imply in principle order. Instead, the property implies delivery or transformation happens once and no more than once, the order of delivery or execution does not matter. Perhaps order matters for the application, but I am focusing on the exactly-once property, not the overall set of properties the application requires.

Assuming my observation about order is correct, atomic broadcast and exactly-once delivery cannot be equivalent. If exactly-once delivery and atomic broadcast are not equivalent, then exactly-once delivery and consensus cannot be equivalent either.

Exactly-once delivery could be equivalent to reliable broadcast, which has no order guarantee, is a weaker problem compared to atomic broadcast and not surprisingly is solvable in asynchronous systems. However, because exactly-once delivery implies a property about a send-receive interaction, not necessarily about a broadcast, a more appropriate abstraction is the one of a reliable channel, which guarantees that messages are eventually delivered, and not lost of duplicated [10].

A note on FLP: The FLP result argues and shows that in an asynchronous system, we cannot really tell whether a process is crashed or just slow to respond. In some situations, under such uncertainty about other processes, we could end up violating the agreement property  if a process takes a step to guarantee termination. In practice, there are situations that can cause a system to block indefinitely, but we never violate agreement. There is progress in the case the system is live, and liveness is defined according to the protocol used.

One interesting observation with respect to FLP is that the proof assumes that messages are delivered eventually in the absence of a crash and exactly once. It assumes that messages can be reordered, but the delivery happens eventually, assuming again that the receiver process does not crash. Interestingly, such a channel, that would be sufficient to satisfy exactly-once delivery, is not sufficient to enable consensus in asynchronous systems as the proof shows. Note that the fault model in the FLP proof does not consider recovery, which is critical for practical purposes. However, it does not lose generality as the scenarios under the crash model can also happen under the crash-recovery model.

Exactly once and consensus: The origin?

One relevant question to ask is where this perception that exactly-once delivery and consensus are equivalent problems comes from. I conjecture that it is due to the ability of replicated state machines of having a consistent state despite crashes and to deduplicate using the state of the state machine.

Replicated state machines typically replicate a log. Such a log is used to persist and order the state transitions of the state machine. When restarting a  replica, it is sufficient to replay the transitions in the log (we often use snapshots to avoid replaying from scratch).  With such a mechanism, even if we restart a replica after a crash, we end up with a consistent state because the state transitions will be the same every time, and the log is replicated in a totally-ordered manner. The log and the state itself can also be used to determine what transitions or operations have been processed to avoid duplicates during the broadcast phase. For example, a transition with a given identifier such that the identifier is already in the log is not processed again.

What’s the relation to consensus? As mentioned above, consensus is at the core of replicated state machines. Consequently, this mechanism argues that consensus is sufficient, but not that it is necessary, which I have actually argued above that it isn’t the case.

Bottom line

Exactly-once intuitively means that something happens once and only once. In the context of some current systems, however, the term implies that something that happens multiple times is effective only once. Multiple applications of a transformation or message delivery only affect the state of a given application or system once.

Two core concepts to achieve this property are idempotence and transactions. With idempotence, multiple executions of the same given transformation do not cause the state to diverge from the correct one as the transformation is reflected by definition only once. With transactions, aborts have the ability of rolling back partial updates and, as such, they enable an arbitrary number of retries. Idempotence and transactions are not interchangeable, as we briefly pointed out here, but they enable the mechanisms to apply some transformation multiple times without causing state (e.g., of a system, of a data structure, of an application) to diverge from the correct one.

One point often raised in the discussion of exactly-once delivery is its connection to consensus and the FLP impossibility result. I have argued here that exactly-once delivery and consensus are not equivalent, and in fact, suggested that it is a weaker problem compared to consensus primarily because it does not require order of delivery. The treatment of the subject was by far not rigorous, but I have cited results in the literature to support the argument. If it holds, the relevance of this result is that it implies that implementing exactly-once delivery is actually easier than implementing consensus in the absence of other strong requirements such as total order.

Acknowledgements

Thanks to my colleagues Gregory Chockler, Stephan Ewen, Bhargav Gulavani, and Sean T. Allen for comments that helped me shape the ideas in this post.

References

[1] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (April 1985), 374-382.

[2] Alfred Z. Spector. 1982. Performing remote operations efficiently on a local computer network. Commun. ACM 25, 4 (April 1982), 246-260.

[3] Andrew D. Birrell and Bruce Jay Nelson. 1984. Implementing remote procedure calls. ACM Trans. Comput. Syst. 2, 1 (February 1984), 39-59.

[4] Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22, 4 (December 1990), 299-319.

[5] Lamport, L. (2001). Paxos made simple. ACM Sigact News, 32(4), 18—25.

[6] Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC’14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.

[7] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks (DSN ’11). IEEE Computer Society, Washington, DC, USA, 245-256.

[8] Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (March 1996), 225-267.

[9] Romain Boichat and Rachid Guerraoui. 2005. Reliable and total order broadcast in the crash-recovery model. J. Parallel Distrib. Comput. 65, 4 (April 2005), 397-413.

[10] Anindya Basu, Bernadette Charron-Bost, and Sam Toueg. 1996. Solving Problems in the Presence of Process Crashes and Lossy Links. Technical Report. Cornell University, Ithaca, NY, USA.

Advertisements

4 thoughts on “No consensus in exactly-once

  1. I’m confused why idempotency is brought up as exactly-once. By definition idempotency says f(x) = f(f(x)). Idempotency can make multiple deliveries appear to be exactly-once but it does not make it exactly-once

    Like

    1. I like to think of idempotence in computer science more along the lines of how it is defined here with behavioral equivalence:

      https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/popl38-ramalingam.pdf

      When talking about updates to data structures, it is useful to refer to a property about updates being eventually executed and not being reflected in the data structure more than once (e.g., not counting twice), i.e., updates being applied once and no more than once.

      This is not to say that it is entirely appropriate to use the terminology this way, though. It is part of the point of the post that there is a bit of abuse of terminology in the current parlance of distributed systems . When someone says “I have exactly-once” and someone else responds with “but it is impossible”, they are often talking about different things, which is bad; there is clearly some terminology discrepancy.

      I did not attempt to converge on terminology here, but instead I tried to comment on a few different ways that I have seen others referring to exactly-once, including data structure updates and message delivery. I also tried to raise the point about FLP because I have seen comments around it multiple times in the recent past.

      Like

      1. My point is more that idempotency is how you handle situations where you do not have exactly-once semantics. It does not provide them.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s