Layers in THE - Duke University

Layers in THE - Duke University

VAXclusters: A Closely Coupled Distributed System Landon Cox February 10, 2017 Tight and loose coupling Characteristics of a tightly-coupled system Close proximity of processors High-bandwidth communication via shared memory Single copy of operating system Characteristics of a loosely-coupled system Physically separated processors Low-bandwidth message-based communication Independent operating systems Tight and loose coupling Tightly-coupled systems can provide great performance What are the disadvantages of a tightly-coupled system? Scaling gets extremely expensive (e.g., supercomputer) Relatively hard to extend (add components) Failed component can often bring down entire system

Closely-coupled VAXClusters tried to resolve this tension Want extensibility (should be easy to add components over time) Want availability (i.e., fault tolerance) Should be relatively affordable Performance should be acceptable Tight and loose coupling Performance bottleneck for loose coupling Communication between processors Processes in tightly-coupled systems use shared memory Processes in a loosely-coupled system use messages Physical memory Process 1 Process 2 Phys

mem Phys mem Process 1 Process 2 Tight and loose coupling What makes message passing so much slower? The interconnect (network versus memory bus) Need to copy data into and out of various address spaces Physical memory Process 1 Process 2 Phys mem Phys mem

Process 1 Process 2 Close coupling So we have to make communication fast System Communication Architecture (SCA) Computer interconnect (CI) for message passing CI Port provided hardware support Types of messages that SCA supported Messages (small w/ ordered, reliable delivery) Datagrams (small w/ unordered, unreliable delivery) Block transfers (large w/ ordered, reliable delivery) CI Port interface Data structures through which software and CI Port communicate CI Port registers 7 queues (4 command queues, response queue, free queues)

Phys Mem Commands Port driver Responses Free queues CI Port CI Port interface Block-transfer commands include Hosts know this info because

Exchanged via messages Messages/datagrams act as control plane, block transfers as data plane How does this reduce copying? Command points into src/dst address spaces (page tables) Identifies contiguous regions for data to be copied Dont have to copy data into individual messages Datagram/messages include payload in command CI Ports can reach into memory themselves How else does this improve performance? Far fewer interrupts for OS to handle Instead of interrupting every 576 bytes for new packet CI Port can copy data into dst address space w/o interruption

Only interrupt when transfer is complete Storage Single, networked storage interface Disks were distributed across cluster Single namespace for all files Advantages of a unified file-system namespace Makes it easier to add new nodes Makes sharing files easier Can login anywhere and get to your data Storage If we want to allow concurrent access to files, we need Synchronization primitives Otherwise, processes cant coordinate their activities This was a problem for single-host file systems But the OS kernel synchronized access

Why is synchronization harder in a cluster? Cluster is a distributed system Nodes can fail, messages can be lost, etc. Storage When synchronizing access to in-memory data Locks are also in memory No magic: locks, objects all in the same address space Why not use file content itself as the basis for syncing? (i.e., write lock=owned to file lock.txt) Would require very strong consistency guarantees Equivalent to memory accesses Horrendously slow, and potentially have to pay penalty at all times

Implementing locking First have to agree on cluster membership How does a cluster agree on its membership? If we dont all agree on who is around Its going to be really hard to agree on anything else Each node has a connection manager Each connection manager has a copy of the membership Use a quorum voting scheme Consensus is a really hard problem

Nodes can fail and come back on-line arbitrarily Messages can be lost or slow Impossible to distinguish between failures and slow performance Locking interface Users define locks and their names How are locks named? Locks represent a hierarchical namespace Maps nicely onto the file-system namespace What kinds of modes can locks be in? Exclusive access, protected read Concurrent read, concurrent write, null, etc. Locks thus far Lock anytime shared data is read/written Ensures correctness Only one thread can read/write at a time Would like more concurrency How, without exposing violated invariants? Allow multiple threads to read (as long as none are writing)

Reader-writer interface (called when thread begins reading) readerFinish (called when thread is finished reading) writerStart (called when thread begins writing) writerFinish (called when thread is finished writing) If no threads between readerStart writerStart/writerFinish Many threads between readerStart/readerFinish Only 1 thread between

writerStart/writerFinish Reader-writer interface vs locks R-W interface looks a lot like locking *Start ~ lock *Finish ~ unlock Standard terminology Four functions called reader-writer locks Between readStart/readFinish has read lock Between writeStart/writeFinish has write lock Pros/cons of R-W vs standard locks? Trade-off concurrency for complexity Must know how data is being accessed in critical section Back to VAXclusters

Hierarchical locks Who maintains the locking state (i.e., the queues)? Each lock has a master node First node to request the lock is the master How do I find a locks master node? Allows coarse-grained mutual exclusion (tree roots) Allows fine-grained concurrency (tree leaves) Through the resource directory

The resource directory is replicated at several nodes If you dont find a lock master in the directory Youre it! Have to update the directory to reflect your status Handling failure Connection manager to lock managers We are in transition, please de-allocate locks. What does a lock manager do? Releases all non-local locks Re-acquires all locks owned before transition (creates new directory nodes and re-distributes masters) What does this guarantee about the state of data? Not much

Can leave in an inconsistent state No guarantee that previous lock holder will get it again Influence of VAXClusters Many of the concepts are relevant today Distributed locking Consensus and failure detection High availability using cheap hardware For example Google, Facebook and every other cloud service e.g., infrastructure to support MapReduce jobs How can things fall apart Machines Machines

Machines Machines Machines can can can can can Easier get slow crash and reboot crash and die become partitioned behave arbitrarily Harder Step 1: dont lose data if machines crash and reboot Step 2: dont lose data if machines crash and die What has to happen if machines are not guaranteed to restart after a crash?

Transactions have to commit at > 1 machine Paxos ACM TOCS: Transactions on Computer Systems Submitted: 1990. Accepted: 1998 Introduced: Butler W. Lampson http://research.microsoft.com/en-us/um/people/ blampson / Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT..He was one of the designers of the SDS 940 time-sharing

system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAEs Draper Prize in 2004. Barbara Liskov MIT professor 2008 Turing Award View-stamped replication PODC 88 Very similar to Raft State machines At any moment, machine exists in a state What is a state? Should think of as a set of named variables and their values State machines Clients can ask a machine about its current state.

What is your state? Clien t 1 4 2 3 6 5 My state is 2 State machines actions change the machines state What is an action? Command that updates named variables values

State machines actions change the machines state Is an actions effect deterministic? For our purposes, yes. Given a state and an action we can determine next state w/ 100% certainty. State machines actions change the machines state Is the effect of a sequence of actions deterministic? Yes, given a state and a sequence of actions, can be 100% certain of end state Replicated state machines Each state machine should compute same state, even if some fail. Clien t

What is the state? Clien t What is the state? What is the state? Clien t Replicated state machines What has to be true of the actions that clients submit? Applied in same order Clien t Apply action

a. Clien t Apply action c. Apply action b. Clien t State machines How should a machine make sure it applies action in same order across reboots? Store them in a log! Actio Actio n n Actio Actio n n

Actio Actio n n Replicated state machines Once we have a leader it begins to service client requests. Leader=L Leader=L Apply action a. Leader=L Clien t

Leader=L Replicated state machines Common approach Take simple, general service Implement using consensus Allow more complex services to use simple one Examples Chubby (Googles distributed locking service) Zookeeper (Yahoo! clone of Chubby) Replicated state machines Common approach Take simple, general service Implement using consensus Allow more complex services to use simple one Examples

Chubby (Googles distributed locking service) Zookeeper (Yahoo! clone of Chubby) Zookeeper Hierarchical file system A tree of named nodes / Directories (contain data, pointers) /app1 Files (contain data) /app2 Clients can Create/delete nodes Read/write files Receive notifications Used for coordination /app1/ config /app1/

online /S1 /app1/ online / /app1/ online /Sn Zookeeper Two kinds of nodes / Persistent Ephemeral Persistent /app1 Exist until deleted

e.g., service-wide config data /app1/ Ephemeral Exist until client departs e.g., group membership /app1/ config /app1/ online /S1 /app2 /app1/ online / /app1/ online /Sn

Zookeeper Automated sequence numbers / Node names can be sequenced Will prove very useful Clients create nodes /app/foo-X ZK chooses an order /app/foo-1 /app1/ config /app/foo-2 /app/foo- /app1 /app1/

online /S1 /app2 /app1/ online / /app1/ online /Sn Zookeeper API Create creates node in tree Delete deletes node in tree Exists Can embed information in

the hierarchical namespace tests if node exists at location Get children lists children of node Get data reads data from node Set data writes data to node Sync waits for data to commit Can store extra details in individual nodes data Zookeeper implementation Zookeeper Follow

er Follow er Clien t Leader Follow er Follow er Clien t A client connects to one server (any server will do). Zookeeper guarantees Sequential consistency

Updates from client applied in order sent. Atomicity No partial results. Updates succeed or fail. Single system image Client has same view, regardless of server. Reliability Applied updates persist until overwritten by another update. Timeliness Clients view guaranteed to be up-to-date within a time bound. Zookeeper does not provide strong consistency ient reads not guaranteed to contain all other client updates

Zookeeper guarantees Sequential consistency Updates from client applied in order sent. Atomicity No partial results. Updates succeed or fail. Single system image Client has same view, regardless of server. Reliability Applied updates persist until overwritten by another update. Timeliness Clients view guaranteed to be up-to-date within a time bound.

Zookeepers consistency guarantees Read my writes (see your own updates) Consistent prefix (see a snapshot of the state) Monotonic reads (never go backward in time) Example use Want to crawl and index the web Would like multiple machines to participate Want URLs explored at most once Solution Maintain an ordered queue of URLs Crawling machines assign themselves a URL to explore Crawling machines may add new URLs to queue Requires a producer-consumer queue // Class simulating workers adding URLs to the queue public class CreateQueue { private class QueueAddWorker extends ConnectionWatcher implements Runnable { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; public QueueAddWorker(String name){ this.name = name; } @Override public void run() {

try { while(true){ this.connect("localhost"); String added = zk.create("/queue/q-", null, Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT_SEQUENTIAL); this.close(); Thread.sleep(new Long(r.nextInt(1000)+50)); } } catch (Exception e){ e.printStackTrace(); } } } Nodes names are sequenced, e.g., /queue/q-0000000003 public static void main(String[] args){ CreateQueue cQ = new CreateQueue(); Thread addWorker1 = new Thread(cQ.new QueueAddWorker("worker1")); Thread addWorker2 = new Thread(cQ.new QueueAddWorker("worker2")); Thread addWorker3 = new Thread(cQ.new QueueAddWorker("worker3")); addWorker1.start(); addWorker2.start(); addWorker3.start(); } }

Now we can use watches to be notified when new nodes are added. http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } Registers object as a watcher for the node /queue @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) {

if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } }

http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } When check /queue changes, we if any children were added @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children);

// we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); }

} } http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } Start to iterate through the new queue entries @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) {

if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } }

What is the problem with workers concurrently processing new entries? Work would be duplicated, as all workers process all new entries http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } Start to iterate through the new queue entries @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list

for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } }

} How do we prevent duplicate work? Have workers lock entries that they want to process http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) {

try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE);

} finally { pQ.zk.close(); } } What happens if workers concurrently try to create the same file? } Only one will succeed, the failed worker will catch an exception http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) {

if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]);

Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } } Why is it possible for only one create to succeed? ZK returns to client on commit; new nodes must reach majority to commit http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children);

// we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost");

try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } } Are workers reading children of /queue_lock, guaranteed to see all locks? No, a worker is only guaranteed to see the locks it created http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try { if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch

Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } These lines try to create a lock file for the entry. public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue();

pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } } What happens if a worker fails immediately after creating the lock file? The file is ephemeral and disappears when creator stops responding http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ public class PullQueue extends ConnectionWatcher { private class PullWatcher implements Watcher { private DateFormat dfrm = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z z"); private Random r = new Random(); private String name; private List children; public PullWatcher(String name) throws Exception { this.name = name; children = zk.getChildren("/queue", this); } @Override public void process(WatchedEvent event) { try {

if (event.getType().equals(EventType.NodeChildrenChanged)) { children = zk.getChildren("/queue", this); // get the children and renew watch Collections.sort(children); // we are getting an unsorted list for (String child : children) { if (zk.exists("/queue_lock/" + child, false) == null) { try { zk.create("/queue_lock/" + child, dfrm.format(new Date()).getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); // PROCESS QUEUE ENTRY zk.delete("/queue/" + child, -1); } // even so we check the existence of a lock, // it could have been created in the mean time // making create fail. We catch and ignore it. catch (Exception ignore) { } } } } } catch (Exception e){ e.printStackTrace(); } } } After processing a queue entry, we delete it and try another

public static void main(String[] args) throws Exception { PullQueue pQ = new PullQueue(); pQ.connect("localhost"); try { pQ.new PullWatcher(args[0]); Thread.sleep(Long.MAX_VALUE); } finally { pQ.zk.close(); } } } http://henning.kropponline.de/2013/01/14/zookeeper-distributed-queue-locking/ Leader election on top of ZK Zookeeper elects a leader internally Uses Paxos, but could use Raft too Internal ZK leader election isnt exposed to services Services can only manipulate ZK nodes But many services need to elect leaders e.g., a storage service that uses two-phase commit Easy to implement leader election on top of ZK Leader election on top of ZK

To volunteer to be a leader 1. 2. 3. Create node z /Election/n_ with sequence and ephemeral flags Let C be /Election/s children, and i be zs sequence number Watch Election/n_j where j is smallest sequence < i Can two servers create the same /Election/n_ node? No, ZK ensures that each file has a unique sequence number Leader election on top of ZK To volunteer to be a leader 1. 2. 3. Create node z /Election/n_ with sequence and ephemeral flags Let C be /Election/s children, and i be zs sequence number

Watch Election/n_j where j is smallest sequence < i Which server is the leader? The one that created /Election/n_j Leader election on top of ZK To volunteer to be a leader 1. 2. 3. Create node z /Election/n_ with sequence and ephemeral flags Let C be /Election/s children, and i be zs sequence number Watch Election/n_j where j is smallest sequence < i How will we know when the leader fails? When node /Election/n_j is deleted Leader election on top of ZK To volunteer to be a leader 1.

2. 3. Create node z /Election/n_ with sequence and ephemeral flags Let C be /Election/s children, and i be zs sequence number Watch Election/n_j where j is smallest sequence < i Server is notified that a child of /Election/ was deleted 1. 2. 3. Let C be the new set of children for /Election/ If z is the smallest node in C, then the volunteer is the leader Otherwise, keep watching for changes in smallest n_j Can two servers ever think that they are the leader? Something to work out on your own

Recently Viewed Presentations

  • Ingraham High School College and Career Night May

    Ingraham High School College and Career Night May

    Not offered enough aid from my first choice College hasreputation for good social activities. The college's graduates get good jobs Other I want to go to school near my home I was offered financial assistance I wanted a school about...
  • CMPE108 Algorithms and Flowcharts 1 1. Software Development

    CMPE108 Algorithms and Flowcharts 1 1. Software Development

    Algorithm, Pseudocode and Flowchart. Algorithm A set of instructions, arranged in a specific logical order, which, when executed, produce the solution for a particular problem.. Two Techniques for representing algorithms: Pseudo. code. Semiformal, English- like language with limited vocabulary.
  • 17 - Home - SPIA UGASPIA UGA

    17 - Home - SPIA UGASPIA UGA

    rule —a type of electoral system in which victory goes to the individual who gets the most votes in an election but not necessarily a majority of the votes cast. The main alternative to plurality rule is proportional representation, but...
  • Systems Sales Enablement Training, hand-out Michael Nelson Systems

    Systems Sales Enablement Training, hand-out Michael Nelson Systems

    Once decoupled the Central Processing Unit (what is now correctly referred to as a PC) went from a horizontal steel box to sleek professional desktops, tricked out gaming rigs, tiny minimalist cubes and eventually someone put the PC back in...
  • Physics - Weebly

    Physics - Weebly

    Classical Conditioning Cont. The unconditioned stimulus is the couple and the unconditioned response from viewers is feelings of happiness. When the unconditioned stimulus is paired with the conditioned stimulus (the tequila) the conditioned response to the tequila is happiness.
  • MSET Portfolio presentation - kv023.k12.sd.us

    MSET Portfolio presentation - kv023.k12.sd.us

    Philosophy of Education and Technology. Change! Life-long learners. Application. 21st century skills. Relate to students! Leadership. If I have learned one thing through teaching the last 7 years it's change is a necessity
  • Social Mobility - Strongsville City Schools

    Social Mobility - Strongsville City Schools

    Social Mobility What is Social Mobility? Definition: Movement from one class —or more usually status group—to another Horizontal Mobility Movement from one position to another within the same social level Ex: Changing jobs without altering occupational status Moving between social...
  • What is FACA?

    What is FACA?

    Your Federal Advisory Committee Kirk Foster FMCSA Office of The Chief Counsel 2014 Guidance for You Federal Advisory Committee Act (1972) - Public Law 92-463, Federal Statute 5 U.S.C. App. 1 DOT Order 1120.3B (1993) Your Advisory Committee Charter My...