Messaging Guarantees

The Two Generals Problem

Let’s suppose you are a general in command of a medieval army, and you find yourself in the following situation.

You must siege and capture an enemy settlement. The only way to do it successfully is to coordinate a simultaneous attack along with the forces of another general on the opposite side of the settlement, so you can attack from both flanks. You need to devise a way of coordinating such attack.

You both agree that you will send a message to your fellow general at the time of the attack. And to be sure that the messenger has not been intercepted on its way and the message has made it safe to your fellow general, he will send another message back acknowledging receipt of the first one.

You are about to sign off on the plan, but then your fellow general asks: “How would I know that the confirmation has arrived to you?”. He’s worried, and with good reason. This is because if the messenger carrying the confirmation is intercepted, he would find himself attacking the settlement alone with the prospect of a terrible defeat. That is not a good position to be in!

He proposes that you send an acknowledgement of the acknowledgement, to prevent him from attacking by himself. You are about to agree with his reasonable request, but then you realise something: “Wait a minute! What would happen if my acknowledgement is intercepted? Then you won’t do anything and I’ll be the one attacking by myself!”

The generals have a problem. It should be fairly obvious by now that stacking acknowledgements is not going to help anyone. This problem is impossible to solve. There is no way they can devise a mechanism by which they can coordinate reliably because the medium of their message is fallible.

The Impossibility of Exactly-Once Delivery

This story is fundamental to understanding distributed systems and their challenges. The moral of the story is that is impossible to guarantee the delivery of a message via unreliable transport, no matter the technique we use (like acknowledgements). The internet protocol is an unreliable transport, since connections can be broken, faulty or temporarily lost. Packet loss is an unavoidable consequence.

This has tremendous implications for our applications built on top of TCP or HTTP. For instance, sending a message to a queue can fail. How we handle this failure is crucial to our system architecture.

One option is to retry sending the message. The problem with retrying though is that the effects of the operation being retried could be applied twice or more. For example, if I publish a message to a queue but the publish fails because I didn’t receive any ACK from the message queue, what was the problem? Was my message lost trying to reach the server or was the acknowledgement of the server lost on its way to me?

Remember the two generals' problem? If the acknowledgement was lost on its way to me, then the queue has already processed my message. If I retry the operation, the queue could receive the same message again and the effects of that action could run twice — and that better not be charging a customer! There is no way for me, the producer of messages, to know for sure if the server has effectively processed my requirement or not.

Message Delivery Guarantees

Most people don’t try to solve this problem and are happy with offering their customers the easiest and weakest of message delivery guarantees: at-most-once delivery. This means, that you will send the request only once, but if it fails, you won’t retry. It may or may not be sent once, but you can be sure it won’t be more than once.

When message loss is unacceptable, exponential backoffs with retry logic are implemented. But as soon as you retry, you are embracing an at-least-once delivery guarantee. This means that you will keep trying to send this message until you receive an acknowledgement, but that might send the message more than once if an acknowledgement is lost on its way back to you, the producer.

Bottom line, you can either choose between at-most-once or at-least-once message delivery guarantees, but exactly-once is a technical impossibility.

If you have chosen to support an at-most-once guarantee, you can relax and sit in your nice home office setup in peace. If loss of messages is unacceptable for you and you need to support an at-least-once model, but don’t mind receiving duplicates of messages, then you can get away with exponential retry logic with no problem and still sit pretty. But if duplicates are not an acceptable trade-off either (ejem! payments) then you are in for some fun problem to solve.

Looking at TCP: Exactly-Once Processing

So, is there anything we can do to deal with these unavoidable duplicates? We should look a lower level protocols to rescue some ancient ideas to deal with this.

If you know about internet protocols a bit you will know that TCP is an acknowledgement-based protocol. Every packet sent by the client needs to be acknowledged by the server. This brings us back again to The Two Generals problem. What happens if an acknowledgement is not received by the client? In the case of TCP, then the client sends the same packet again.

But wait!? What if that acknowledgement was lost on its way back to the client and the server did receive the message? Well, simple: the server will receive a duplicate message. It will receive the same packet again.

So, how does TCP deals with this? We could say that — simplifying the actual inner workings of TCP — this problem is solved by using state: the server stores the “ids” of the packets it has seen for that connection and simply ignores and skips any duplicates.

This is actually an oversimplification of the way TCP works and deals with deduplication and ordering. For an exact explanation, you might want to google two key concepts: the Transmission Control Block and Sliding Window on TCP.

Looking at TCP we can actually find a solution to the exactly-once delivery impossibility. We need a way to identify the same message in the context of a retry, keep a list of the seen ids and discard the ones that have been seen already. This technique is commonly known in web services as idempotency. Some other people call it deduplication. I like to call it exactly-once processing.

It requires the client to “assign” a unique identifier — sometimes called the Idempotency key — to every message. But it also requires the server to “track” which ids have been processed so they can be ignored if sent twice. Using this technique is possible to process a message once and only once and avoid the effects of an operation being accidentally executed twice.

Of course, there must be a limited window of time the server has to store the tracked ids (since it would be wasteful and inefficient to store every possible seen id) and if you have multiple servers processing the message, it also has to be a centralised form of storage.

Conclusion

Distributed systems are composed of a group of processes that need coordination and whose sole way of communication is via the network. But as the network is fallible, is impossible to guarantee that a message will be delivered successfully. Understanding this problem, its tradeoffs and how to mitigate it is essential when designing and developing distributed systems.