Saturday, May 25, 2013

2PC or 3PC

"2PC or not 2PC, that is the question". Perhaps if Shakespeare were alive today Hamlet would be asking one of the most popular questions in distributed systems for at least the past three decades. I'm not going to revisit the reasons why we use a consensus protocol in ACID transactions, because you can find enough articles on the subject if you read this blog. However, it seems that the distinction between different consensus protocols, specifically two-phase commit (2PC) and three-phase commit (3PC) is not necessarily as widely understood, or rather why vendor transaction managers use 2PC. Yes, there are other consensus protocols, such as Paxos, but let's just focus on 2PC or 3PC today. Before we start looking at why 2PC is more popular, if you haven't read about the FischerLynchPaterson (FLP) results which show that you can't reach agreement in an asynchronous environment if even one failure is allowed, without augmenting the system with, say, failure detection mechanisms.

Strict 2PC is a blocking protocol: for instance, if the coordinator crashes during the second (commit) phase, then the participants must remain in their indeterminate state until it recovers, which could be forever if recovery never happens. Now of course using a strict 2PC protocol in the real world would immediately present some problems: as long as participants remain blocked, the resources that they represent and hold, e.g., locks, are maintained, thus possibly preventing further work on behalf of others. If failures never happen then it's not an issue, but unfortunately the 2nd law of thermodynamics will always get you in the end: entropy increases and failures do happen, no matter how improbable.

This is why heuristics were introduced as a way of allowing 2PC to operate in a more pragmatic manner: any participant that has got past the first (prepare) phase and does not receive a response from the coordinator in a timely manner can unilaterally decide to commit or abort its portion of the transaction. If it gets the decision right (same as the one the coordinator made) then everything is fine. If it gets the decision wrong, then we have a non-atomic outcome, or a heuristic outcome. Resolving this typically requires outside help, e.g., a system administrator. However, with the assumption that failures are relatively rare coupled with the fact that a failure would have to happen after the prepare phase to potentially cause a heuristic, this is not such a bad compromise. Yes heuristics happen, but they will typically be rare. And of course let's not forget about those other optimisations we typically use with 2PC, such as presumed abort, which help to reduce performance overhead as well as allowing safe autonomous choices to be made which do not result in heuristics.

But what about 3PC? You can find good descriptions of the protocol in the literature and elsewhere, so I won't go into details here. Suffice it to say that 3PC removes the blocking nature by disseminating the decision from the coordinator about whether to commit or abort the transaction amongst all of the participants. This means that if the coordinator does crash, the participants can still move forward to complete the transaction. In theory this is a good thing. However, an additional phase, which includes information about the transaction outcome and participants, introduces an overhead in every transaction and this overhead is only useful if the coordinator fails.

As such, 3PC is good for environments where failures are common (more probable) and where heuristics aren't allowed. This additional overhead is not something which the majority of environments, users, use cases etc. are prepared to accept and prefer to use 2PC, along with the other optimisations and trade-offs I mentioned earlier. We optimise for the failure-free environment, which is still the most probable situation for the majority of transaction use cases. That's why all major transaction implementations today use 2PC and not 3PC. Now that doesn't mean it will continue to be the case: if environments or scenarios change in such a way that blocking or heuristics are no longer applicable (maybe mobile) then we may see a change in direction.

No comments: