分类: 架构设计与优化
2015-10-22 16:27:13
One of the common problems faced by anyone building large scale distributed systems: how to ensure that only one process (replace this with: worker, server or agent) across a fleet of servers acts on a resource? Before jumping in to how to solve this, let us take a look at the problem more in detail and see where it is applicable:
What are the examples of where we need a single entity to act on a resource?
Coordination across distributed servers
All these scenarios need a simple way to coordinate execution of process and workflows to ensure that only one entity is performing an action. This is one of the classic problems in distributed system as you have to not only elect a leader across a set of distributed processes (leader election) but also detect its failure (failure detection) and then change the appropriate group membership and elect a new leader.
A classic way to solve this problem is to use some form of system that can bring consensus. This can be a single server which can act as the source of truth (not a good idea for obvious reasons of availability or durability), a system that implements a consensus algorithm like Paxos or build a form of leader election on top of a strongly consistent datastore (will get to this later).
What is Paxos?
Paxos is the gold standard in consensus algorithms. It is a distributed consensus protocol (or a family of protocols if you include all its derivatives) designed to reach an agreement across a family of unreliable distributed processes. The reason this is hard is because the participating processes can be unreliable due to failures (e.g., server failure, datacenter failure or network partitions) or malicious intent (byzantine failures).
You can read more about Paxos here:
How can I use Paxos to solve the problem of coordination?
In Amazon, we use consensus algorithms heavily in various parts of the company. A few ways we have solved the problem of solving this:
Lock service has been a widely popular means to solve this problem. Amazon has built its own lock service for internal use, Google building , and open source systems like Zookeeper.
Using a strongly consistent datastore to build lock service
One of the common tricks we have seen being used in the past is to emulate a lock service behavior (or APIs) using a strongly consistent datastore. Basically, a strongly consistent datastore that is replicated, highly available and durable (e.g., DynamoDB) can be used to persist the information on who is holding it, for how long etc.
Now the issue that this solution does not solve is failure detection. Traditionally lock service also provides a simple library that the clients can use to heartbeat with the service. This ensures that when a lock holder is partitioned away, the lock service revokes the lock and the library notifies the lock holder to abort its critical section.
In using a datastore to build lock service APIs, you have to solve this solution. In this front, I came across an interesting on how Ryan has built a lock service API on DynamoDB. He has outlined the problem, solution and limitations quite well and very interesting read.
Do you need lock service at all?
An obvious question to ask yourself before venturing into any of the approaches above is: Do you need a lock service after all? Sometimes the answer is you don’t need to as executing the same workflow twice may not be an issue. Whereas there are many other cases where you need such an abstraction. So, please make this decision carefully.