Wednesday, February 24, 2010

Eventual Consistency vs Atomic Consistency

I was having a very interesting and neurally stimulating conversation with Brian Holmes who runs SmashedApples, a cloud computing framework for Flex, about Eventual Consistency and trusting that data is correct.

In my original post on swarm computing, I was discussing the use of shards over a distributed system. One thing that I hadn't considered was the possibility that you could write data to more than one source. I still held to the principle that when it comes to data, a system requires a single source of truth. After pondering this thought for a while, I have firmly come to the conclusion that you do not necessarily require a single source of truth.

What is Eventual Consistency?
Eventual consistency is the premise that eventually all copies of data will propagate and update themselves. I always had a problem with this. Most situations assume atomic consistency, or that once a write is made, any further reads will obtain the correct information.

This assumption of atomic consistency is a hang over from decades of programmers working with Relational Databases on single systems. In distributed collaborative systems, there is always a degree of lag. It comes down to how that lag is handled by the system.

The concept of Read Validation over Write Validation
We all have encountered and used write validation. By Write Validation, I mean that you cannot save the state of a system without passing all integrity checks.

Let's say that we have a payment form for a utility company, and we have a field that allows you to put in how much you want to pay on your bill. A write validation is that you can't put $-100 into the field and save the form. A read validation is that for submission you cannot have $-100 in the field, but you can save any and all values to the server.

Example 1 - A Holiday in Elbonia:
Let's say a fellow decides to go on holiday. Let's call him John Smith. John loves adventure, danger and generally experiencing life at the edge of civilization. In fact, much to the dismay of his twin sister Jane, John would have much preferred to be involved in anthropological activities on a National Geographic level rather than a computer programmer.


This is why John has decided to depart for the shores of the island of far away Elbonia.

To prevent his dear sister from suffering a heart attack from worry, John has promised Jane that he will provide her with regular updates as to where he is, and what he is doing. Hence, before checking in, John decides to update his facebook profile to say that he is at the airport.


Under the atomic consistency model, John would push the update to the server, the server would update, and anyone who would read his facebook page would immediately see he is at the airport.

Under the Eventual Consistency model under normal operation, both John and Jane would see the same thing within a mere number of seconds, as the clouds holding the data would be updated when the call is made.

But what would happen if the City Node was unable to talk to the central Facebook cloud?

John would see that his change has been accepted on his device, but his sister Jane will still see his previous status until the City cloud node that she sees has updated.

But what happens if multiple updates occur?

Let's assume that John is busy playing the latest game he downloaded to his portable device, and almost completely runs his battery dead. It is a long flight to the Balkans; especially if you fly with Elbonian airlines. All the way along, John is updating his status but his device cannot talk to the Internet as his device must be in Flight Mode.

When John arrives in Armpit, the capital of Elbonia, he updates his status to "Arrived At Customs", at which time all of his previous updates are sent.

Since the Internet is not so reliable in a fourth world country such as Elbonia, as luck would have it, the link to the outside world is currently down. After John leaves customs five hours later, he receives a frantic call from Jane, who says he hasn't updated his facebook status. John calmly explains that the internet is down in Elbonia, but he's heading for the hotel now. Just then, John's phone runs out of power.

After Jane's conversation ends with John, Jane logs in to Johns account, and updates his status to Going to the hotel.

When John returns to the hotel, he logs in to update his status. He still sees "Arrived At Customs", so he updates his status to "At Hotel".

An hour later the Internet in Elbonia is back up again, and the servers update. So what should everyone see?

An assumption must be made that the latest data is the most relevant. Each status change would have a timestamp. Jane's timestamp would be earlier than John's timestamp, so the servers can record the change to the record in the correct order. The latest status change would then be that John is "At Hotel".

Example 2 - Withdrawing cash in Elbonia:

After settling in to his hotel room, John decides that he wants to go out and buy some food, but he forgot to change money back at the airport. He needs some Elbonian Dollars($EB) or as colloquially known, some Grubniks.




Let's assume that there are two banks, the Bank of Elbonia (BoE), and the First National Bank of Elbonia (FNBE). Say that John has a bank account with WestBank back in civilization, with $101 in it.



Let's say John tries to withdraw $100 from the BoE, and $100 from FNBE immediately afterwards. Given the case of atomic consistency, John can successfully withdraw $100 from BoE, but not from FNBE, as he has insufficient funds. This is correct behaviour.


If we are using Eventual Consistency, under normal circumstances, the lag from one bank to the next will be far less than the time it takes for John to complete your transactions, let alone that John is able to get to another ATM and withdraw funds. So hence, John can successfully withdraw $100 from BoE, but not from FNBE.

Now lets say that Elbonia is having internet issues again, and the connection between BoE, FNBE and WestBank are down. This gives us a few scenarios.

Under the atomic consistency model, as BoE and FNBE cannot talk to Westbank, the transaction will fail. This is obviously not a good situation for the client, as they do not have legitimate access to their funds. This is because we are relying upon Write Validation.

Under the Eventual Consistency model, you will be able to withdraw $100 from BoE, which will record the $100 delta change on your account. Since the bank has not yet communicated the delta change, our intrepid explorer can successfully withdraw funds from both the Bank of Elbonia, and the First National Bank of Elbonia and wind up with a $101 balance. Right?

Well, yes and no. It's just a question of when. Let me first clarify that if John were to return to the BoE or the FNBoE, and try to withdraw money, he would be greeted with a $1 balance.

Eventually, the internet (even in Elbonia), will come back up. This is when Westbank receives the 2 deltas of a $100 withdrawl, and leaves our punter with a negative balance of -$99 in his account.

The advantage of using Read Validation in this circumstance is that John will be able to legitimately withdraw $100 from his bank account. The disadvantage is that John will be able to illegitimately withdraw a further $100 from his bank account, which will need to be resolved by business procedures and workflows later.

But I bet a few of you out there just like John might be thinking, hey, that's great! Why don't I just pull out a million bucks in cash, and then move to Paraguay? It comes down to the business rules of an ATM to prevent you from being able to do that. Remember that there is a maximum amount that can be withdrawn from an ATM at a given time, say $1000. Furthermore, there are only a set number of banks operating in a country.

Following this assumption, not many people will be withdrawing copious amounts of money given these circumstances, as generally Terms of Service will mean that the banks can communicate with each other in under a day in the worst of disasters.

Couple this to the high level of scrutiny that is placed to holders of bank accounts with the Anti Money Launder and Counter Terrorism Funding (AML/CTF) legislation that has been passed in many countries; You have a high level of assurance that the person exists and is legitimate. In any case, a shady character would not be able to glean enough from a system to make it worth their while to commit fraud.

This level of risk can be acceptable for some banks. But what if the bank that John transacts with does not wish to allow withdrawls if communication channels drop out? This can technically still be achieved.

Some objects of very high importance may still require a single source of truth to authorize transactions. Given this situation, you may only have partial functionality should a link go down. As in, John could check his balance from an Elbonian ATM, but could not actually withdraw any money.

There is another situation as well. Some bank accounts may allow for a certain amount of money to be passed out, but other bank accounts may not. This would merely be a piece of data associated with John's account to say that he can withdraw up to $800 per day, or $1000 or however much risk that both John and his bank are willing to accept.

In the end, in this example the Read Validation method provides more flexibility and customer service for the consumer, whilst still allowing for resolution of the problem in this area.

Example 3 - The disjointed timesheet
As always, on his first day on holidays, John gets a phone call from his boss, Bob. After the usual discussion of explosions, earthquakes, fires, flood and tidalwaves caused by a small show stopper defect in John's code that got shaken out in testing, John has to provide a fix.

Of course this is not that big a problem for John who implements the fix and updates SVN in under 8 hours, but now he has to update his timesheet. The timesheet application is cloud enabled, so he updates his timesheet, and is blissfully unaware that Elbonia has lost its connection to the outside world again.

John happens to be assigned to a task called "Release Defects" to which he places 8 hours on. This updates to the server.

In the meantime, the whole discussion about John in the office means that Dave, the resource manager, remembers that he had promised John that he would fill out his holiday timesheets for him. So, he opens John's timesheet and places 8 hours of work against his vacation task.

The timesheet has a business rule on it that says that no more than 12 hours of time can be assigned on a particular day. So how would the system resolve this situation?

Updates can occur on mistaken axioms. Dave had assumed that John had not placed any time on his timesheet. John had assumed that Dave had not placed any time on his timesheet. Both of them relied upon a mistaken belief. Dave thought that John has spent no time working, and John didn't know that Dave had updated his time.

This is why Read Validation must allow data that is invalid to the business rules to exist. Although there is currently 16 hours worth of work on John's timesheet, users should be able to enter more time. However, when John goes to submit his timesheet for approval, it shouldn't be allowed, as he has not satisfied the business rule of 12 hours.

So what would happen if Dave subsequently submitted John's timesheet?

There is a business rule that states that once a timesheet has been submitted for approval, it cannot be changed unless it has been unsubmitted.

So let's say that John added his time after Dave submitted his timesheet. In this case, it is clear cut that John should be informed that his timesheet had been submitted before John added his time. It would then be up to John to fix the situation.

If John added his changes before Dave submitted John's timesheet on his behalf, then the system should allow John's changes, roll back the submission, and inform Dave that there is a discrepancy of time.

But we need a new piece of technology here. We need to know when the update was made and what timestamp of data that the user had seen at the time. Each update needs to contain both pieces of information, so that the users can glean the reasoning behind how the decision was made to be able to resolve the conflict. Those familiar with Concurrent Versioning Systems (CVS) and Subversion (SVN) et al, will see that this kind of temporal dilation of information has been dealt with for quite a long time already, and it is an established principle.

In Conclusion
As my friend Brian Holmes always says, cloud computing is like as though every function is an island that can be calculated and the results be shipped out to its destination.

This is just like us humans. We are all islands that have limited vision of the world that we live in. It is not about being perfect in our universe, but rather about being fault tolerant.

Although Atomic Consistency is the ultimate in data integrity, in reality, we never truly have Atomic Consistency in life. We follow business rules and have elastic error validation to tolerate the False Axiom problem. This allows us to have Eventual Consistency, and when a data collision occurs, we can collaborate to resolve the issue correctly.

The Eventual Consistency model comes down to the level of trust that you place into the action, and the level of damage that can occur should an irreversible collision happen, such as withdrawing funds from an ATM whilst you have a negative balance.

This is a business decision, which can be made on a use case basis, so you can still allow for a single source of truth for a select number of transactions, where the business does not wish it to occur. This is exactly like human transactions.

If you go to a store, and their credit card facilities are down, and you don't have any cash, chances are, you are not walking out of there with the goods that you want. This is exactly the same situation with distributed transactions.

As for John, he has successfully conquered Elbonia and gotten paid for the day of his holiday that he had to work.

Meanwhile, Jane is back to worrying whether John is eating a healthy lunch and flossing regularly. No-one said that it's easy being John.

I on the other hand, would like to hear what your thoughts are on this topic.

No comments:

Post a Comment