Exploiting Cost-Performance Trade-offs for Modern Cloud Systems Final Examination Muntasir Raihan Rahman 1 Modern Cloud Systems COST-PERFORMANCE TRADE-OFFS! System Class Cost Performance Key-value store Storage

Consistency Availability Graph processing Processing NoSQL/KV stores Disaster recovery Virtualization Replication Job completion time Graph processing Reconfiguration Availability Best achievable cost-performance trade-off depends on workload and network conditions Disaster recovery 2

Thesis Statement It is feasible to imbue a wide variety of cloud systems with adaptive and opportunistic mechanisms to exploit the costperformance tradeoff space to meet desired tradeoffs. Key-value storage Graph computation engine Cloud disaster recovery 3 Contributions Project PCAP Cloud System Storage Cost Metric Consistency (C) Performance Availability (A)

Mechanism Tune read-delay knob GeoPCAP Storage Consistency (C) Availability (A) Tune geo-read-delay knob GCVM Virtualization Reconfiguration (R) Availability (P)

Hypervisor managed grouping of VMs OPTiC Computation Replication (R) Completion time (T) Opportunistic overlapping of graph preprocessing and computational phases Status Mature during prelim 30% done before prelim, now

complete 70% done before prelim, now complete Project done entirely after prelim (mature) 4 OPTiC Consistency (Cost) Continuously navigate Tradeoff Space to 1. Perform close to optimal tradeoff 2. Meet SLA Our Contribution State of the art Compute Time (Performance)

GCVM State of the art Reconfiguration (Cost) PCAP/GeoPCAP Graph Storage (Cost) Latency/Availability (Performance) Cost-Performance Tradeoffs HW CG Our Contribution SW CG 50 msec VM Pause Overhead Availability (Performance) Move from one point to another point in tradeoff space

5 OPTiC: Opportunistic graph Processing on Multi-Tenant Clusters First principled exploration of graph processing on multi-tenant clusters Bridges the gap between graph processing layer and cluster scheduler layer Key technique: overlap graph pre-processing phase of waiting jobs with computation phase of running jobs Implemented in Apache Giraph + YARN stack 50% improvement in average job turn-around time (median and 95 th percentile) for realistic workload (Facebook, Yahoo! Production Traces) and network conditions (lognormal, inside data-center) Cost of replicating input graphs in DFS increases from 3 to 3 + opportunistically created replica (at-most 1) 6 Graphs are Ubiquitous Biological Food Web Protein Interaction Network

Metabolic Network Man-made Online Social Network (OSN) Web Graph Financial Network The Internet Graphs are Massive Scale: Facebook Graph: |V|=1.1B, |E|=150B (May 2013) 7 Graph Processing Dato PowerGraph Apache Giraph

Databricks GraphX 8 Anatomy of a Graph Processing Job Preprocessing time included Graph in totalPreprocessing job turnaround time Can be (1) significant load from disk [[email protected] 2013] (2) partition Graph Computation (Gather-ApplyScatter) Termination

(1) write results to disk (2) teardown Synchronize at barrier 9 Graph Processing on Multi-tenant Clusters Graph Processing Engines do not take advantage of multi-tenancy in Apache Giraph GraphX cluster scheduler GAP ClusterYARN Schedulers un-aware of graph nature of jobs Apache Mesos Only assume map-reduce or similar abstractions 10

OPTiC: Opportunistic graph Processing on multi-Tenant Clusters Key Idea: Opportunistic Overlapping of (1) Graph Preprocessing Phase of Waiting Jobs with (2) Graph Computation Phase of Current Jobs System Assumptions Synchronous graph processing (workers sync periodically) Over-subscribed cluster (always a waiting job) No pre-emption All input graphs stored in Distributed File System (e.g., HDFS) 11 Opportunistic Overlapping Job 2 90% complete Start preprocessing phase of next waiting job at cluster resources running maximum progress job (MPJ) Job 1

70% complete Cluster Scheduler Job 3 50% complete Benefits: MPJ most likely to free up cluster resources first When the next waiting job is scheduled, preprocessing phase is already underway 12 Two Techniques for Opportunistic Overlapping Desired Features: (1) Minimal Interference on Current Running Jobs Progress-Aware Memory Prefetching Prefetch graph of waiting job directly into memory of

MPJ server(s) MPJ server memory being used to store and compute on MPJ graph Interferes with MPJ, potentially increase MPJ run-time Progress-Aware Disk Prefetching (PADP) Prefetch graph of waiting job into disk of MPJ server(s) No interference with MPJ Local disk fetch avoids network contention memory [email protected] Cluster data (Amazon 20 server virtual cluster) When MPJ done, waiting Amazon EC2 disk bandwidth mean 141.5 MB/s job loads graph from local Amazon EC2 network bandwidth mean 73.2 MB/s disk instead of remote disk

Cheaper to fetch from local disk than from network MPJ=Max Progress Job 13 Architecture: OPTiC with PADP Graph Processing Engine Progress Estimation Engine OPTiC Scheduler Central Job Queue Cluster Scheduler Distributed File System Replica Placement Engine 14 OPTiC-PADP Scheduling Algorithm

OPTiC scheduler For next waiting job in queue o Fetch progress information of running jobs o Determine server(s) S Running Job running MPJ (Maximum Periodically send progress Progress Job) information to OPTiC o Tell 1. Creating additional replicas in disk increases the (non-zero) storage cost DFS toperformance create additional 2. But there is a lot of available space on disks, which are mostly under-utilized replica of next waiting job 3. So the actual dollar cost of the system is close to zero graph in disks of S

Cluster Scheduler Scheduled next waiting job when MPJ finishes Next Waiting Job Scheduled on S Fetch graph from local disk instead of remote disk in DFS 15 Estimating Progress of Graph Computation Profiling: Profile the run-time of various graph algorithms on different cluster configurations for different graph sizes Huge overhead (-) Use Cluster Scheduler Progress Estimator: For example Giraph programs are mapped to map-reduce programs Use cluster map-reduce progress estimator to estimate graph computation progress Cluster dependent (-)

Use Graph Processing Layer Metrics: Track the evolution of active vertex count (AVC) A vertex is active as long as there are some incoming messages from previous iteration At termination AVC = 0 Profile-free, Cluster-agnostic (+) 16 Evolution of AVCP=AVC/N Non-decreasing SSSP Pagerank Non-decreasing Decreasing Decreasing (1) Initial non-decreasing phase: AVCP at or going towards 1 (2) Decreasing phase: AVCP going towards 0

K-core decomposition Non-decreasing Connected Comp Decreasing Progress Measure: How far from final AVCP=0% Decreasing 17 Progress Comparator Algorithm MPJ = Max Progress Job Decreasing Non-decreasing Job1

AVCP 0% 70% 100% 0% CASE 1: Jobs in different phases Job2 AVCP 0% 100% 50% 0% Job2 in 2nd Decreasing Phase: MPJ 18 Progress Comparator Algorithm (2)

MPJ = Max Progress Job Decreasing Non-decreasing Job1 AVCP 0% 70% 100% 0% CASE 2: Both jobs in Non-dec phase Job2 AVCP 0% 20% L 100%

M H 0% The intervals introduce some randomness for jobs with AVCP close to each other Job1 closer to 100% in first phase: MPJ CASE 3: Both jobs in Dec phase (similar) 19 When Does the Algorithm Work? 2 jobs of same type (e.g., SSSP), different size: Job 1 (10K vertices) Job 2 (100K vertices) If both jobs start at the same time, and have the same amount of resources at time t, AVCP(Job 1) closer to termination (0% AVCP) than AVCP( Job 2)

SSSP K-Core 2 jobs of different type, similar run-time: Job 1 (SSSP) Job 2 (K-core) Algorithm makes correct choice if both jobs in second phase during comparison 20 When Does the Algorithm Fail? SSSP K-Core 2 jobs of different type, very different run-times:

Job 1 (SSSP): estimated run-time T Job 2 (K-core): estimated run-time >= 10T Both start at the same time Job1 in Non-dec phase, finishes <= 1 min Job2 in lengthy Dec phase, finishes >= 10 min Algorithm chooses Job2 (in later phase) as MPJ But Job 1 always finishes first! 21 Evaluation Setup Testbed 9 Quad-core servers with 64GB memory 250Mbps bandwidth Network delay emulation using netem Log-normal distribution models packet delays inside a data-center YARN 1 master 8 slaves Each slave has 1 8GB container, total YARN has 64GB memory

HDFS All input graphs stored in HDFS with replication factor 3 Test Algorithm: SSSP, K-core Graphs: Uniform Randomly Generated Synthetic graphs 22 Metrics and Algorithms Performance Metric: Job Turnaround Time (TAT) = job finish time job arrival time Cost Metric: Replication Factor (RF) Compared Scheduling Algorithms: Baseline (B): default YARN FIFO policy (RF=3) PADP (P): OPTiC PADP policy (RF=3 + opportunistically created replica (at-most 1)) Performance improvement = (B - P)/B *100 23 Evaluation Roadmap 1.

2. 3. 4. Real job size distribution trace Real job size distribution trace + real job arrival trace Scalability Job Heterogeneity 24 Facebook Production Trace Workload 95th percentile TAT improves by 54% Median TAT improves by 73% (seconds) Job size distribution from Facebook Trace (Vertex count proportional to map count) Most jobs in cluster are small Poisson arrival process with mean 7s, Network delay LN(3ms)

25 Yahoo! Production Trace Workload Map-reduce job trace from Yahoo! Production cluster of several hundreds of servers Trace has 300 jobs with job size and job arrival times Bursty arrival process 95th percentile TAT improves by 48% Median TAT improves by 47% (seconds) 26 Scale and Graph Commonality Experiment

Baseline super-linear scaling PADP linear with small slope Graph commonality (degree of graph sharing among jobs) increases left to right Average graph size also increases from left to right 27 Experiment with Different Types of Jobs K-core jobs 20 job trace (80% SSSP, 20% K-core) Run-time(K-core) >= 10 * run-time(SSSP) Poisson arrival, mean 7s Network delay, LN, mean 3s Algorithm mis-predicts K-core as MPJ! SSSP jobs PADP performance competitive with Baseline

(seconds) 28 Experiment with Different Types of Jobs (2) 95th percentile TAT improves by 70% Same setup as before Only K-core run-time similar to SSSP run-time using more distributed workers Algorithm predicts MPJ correctly because both jobs likely to be in same phase Median TAT improves by 82% (seconds) 29 Related Work Cluster Schedulers (Map-reduce abstraction, multi-tenant)

Graph Processing (Single-tenant) YARN, Fair Scheduler Mesos, Dominant Resource Fairness Multi-tenancy with fairness for sharing cluster resources OPTiC scheduler aware of graph computation progress Pregel, first message passing system based on BSP GraphLab proposes shared memory computation PowerGraph optimizes for power-law graphs

LFGraph improves performance with cheap partitioning and publish-subscribe message flow OPTiC improves performance for multi-tenant graph processing Progress Estimation Many systems for estimating progress of map-reduce jobs, e.g., KAMD SQL Progress Estimators, e.g., DNE (Driver Node Estimator), TGN (Total Get Next) OPTiC progress estimator based on graph processing level metrics 30 Summary of OPTiC First principled investigation of multi-tenant graph processing Bridge gap between graph processing layer and cluster scheduler layer Key idea: overlap graph preprocessing of waiting jobs with graph computation of max progress running job Use graph level metric to estimate progress of graph job About 50% improvement of job run-time for realistic workloads under realistic network conditions Cost of increased replication of input graph in DFS (3 to 3 + opportunistically created replica (at-most 1)) 31 PCAP [presented during prelim]

Contribution A new adaptive storage system design that uses fine-grained control knobs to navigate the consistency-latency tradeoff space in order to meet novel consistency/latency SLAs perform close to the optimal envelope defined by the PCAP theorem Cost Metric Consistency Performance Metric Latency/Availability Core Mechanism Adaptively tune read delay to meet consistency (latency) SLA, and optimize the other metric

Implementation Incorporated into Apache Cassandra and Basho Riak Status Mature during prelim, did additional experiments with new workload and network conditions 32 PCAP Systems KV-store (Cassandra, Riak) Client GET(k) Coordinator server Read delay knob CONTROL KNOBS

ADAPTIVE CONTROL PCAP System Single DC (1) Close to optimal tradeoff (2) Satisfies PCAP SLA Increase read delay latency increase for read operation replica servers get time to commit latest writes Increase chance of fresh data Continuously adapt control knobs to always satisfy PCAP SLA and optimize other metric Replica servers 33

33 PCAP Evaluation Latency PCAP: pua Latency (Optimal) Consistency SLA: pic=0.135 Consistency PCAP : pic Jump 1 Jump 2 34 GeoPCAP Contribution Cost Metric Performance Metric Core Mechanism Implementation

Status Geo-distributed extension of PCAP Novel probabilistic composition rules for composing consistency-latency guarantees across multiple geodistributed storage systems Adaptive system (based on Feedback control theory) to tune geo-read delay knob to meet consistency-latency SLAs for a geo-distributed storage system Consistency Latency/Availability Adaptively tune geo- read delay to meet consistency (latency) SLA, and optimize the other metric Realistic Simulation Started deriving composition rules before prelim (30%), now complete 35 Probabilistic Composition Rules Composition rules characterize the global behavior across all data-centers Enables us to check whether the composed system meets SLA

36 Evaluation: GeoPCAP Latency SLA Consistency degrades after delay jump WAN 20 ms | WAN 22 ms Latency SLA meet before and after jump Setup: Realistic simulation 4 data-centers Each DC latency-consistency modeled using LinkedIn trace WAN delay normal distribution 37 GCVM Contribution

Cost Metric Performance Metric Core Mechanism Implementation Status A consistency group abstraction allows groups of VMs to be check-pointed and replicated as a unit for recovery under failures. State of the art hardware realization has high reconfiguration cost We propose software-defined consistency group abstraction which facilitates low cost reconfiguration of groups with modest impact on availability (50 msec VM pause time overhead) Reconfiguration of consistency group members Availability Coordinated 2PC realization of pause-parallel snapshot-unpause of groups of VMs across distributed hypervisors Incorporated in ESX hypervisor, used to show recovery of PostgreSQL and distributed Cassandra database Finished design and prototype before prelim (70%), did scalability experiments to show effectiveness of the system after prelim

38 Problem and Solution Consistency Group Abstraction: Group of distributed VMs (with disks) check-pointed and replicated as a unit Correctness (Crash-Consistency): the group snapshot must contain a prefix of all writes submitted to VMs in group State of the art HW solution (Manual Mapping, High Reconfiguration Cost) Proposed SW solution 2PC realization of Pause-Parallel Snapshot-Unpause across distributed ESX servers Correct due to dependent writes, and coordinated pause 39 Evaluation Real Application Recovery Scalability Total VM Pause Time is

(a) flat and (b) low and below 40 ms 40 OPTiC Consistency (Cost) Continuously navigate Tradeoff Space to 1. Perform close to optimal tradeoff 2. Meet SLA Our Contribution State of the art Compute Time (Performance) GCVM State of the art Reconfiguration (Cost)

PCAP/GeoPCAP Graph Storage (Cost) Latency/Availability (Performance) Cost-Performance Tradeoffs HW CG Our Contribution SW CG 50 msec VM Pause Overhead Availability (Performance) Move from one point to another point in tradeoff space 41 Thesis Statement It is feasible to imbue a wide variety of cloud systems with

adaptive and opportunistic mechanisms to exploit the costperformance tradeoff space to meet desired tradeoffs. Efficiently navigate the tradeoff space (PCAP, GeoPCAP) Move from one tradeoff point to another (GCVM, OPTiC) Key-value storage Graph computation engine Cloud disaster recovery 42 Extensions Probabilistic Tradeoff Analysis for Transaction Systems Opportunistic multi-tenant computation for arbitrary data-flow jobs based on progress-aware scheduling OPTiC for distributed machine learning Adaptive consistency for groups of VMs 43 Publications

1. Muntasir Raihan Rahman, Indranil Gupta, OPTiC: Opportunistic Graph Processing for multi-Tenant Clusters, working paper, 2016. 2. Muntasir Raihan Rahman, Sudarsan Piduri, Ilya Languev, Rean Griffith, Indranil Gupta, Software-defined Consistency Group Abstractions for Groups of Virtual Machines, Proc. ACM PODC Workshop on Distributed Cloud Computing (DCC), co-located with ACM PODC, 2016 (Acceptance Rate 40%). 3. Muntasir Raihan Rahman, Lewis Tseng, Son Nguyen, Indranil Gupta, Nitin Vaidya, Characterizing and Adapting the Consistency-Latency Tradeoff in Distributed Key-value Storage Systems, arXiV:1509.02464, under third round of reviews in ACM TAAS, 2015. 4. Si Liu, Peter Csaba lveczky, Muntasir Raihan Rahman, Jatin Ganhotra, Indranil Gupta, Jos Meseguer, Formal Modeling and Analysis of RAMP Transaction Systems, Proc. ACM SAC, 2016 (Acceptance Rate 23%). 5. Edgar Pek, Pranav Garg, Muntasir Raihan Rahman, Madhusudan Parthasarathy, Indranil Gupta, Certified Program Models for Eventual Consistency, under submission, 2015. 6. Si Liu, Son Nguyen, Jatin Ganhotra, Muntasir Raihan Rahman, Indranil Gupta, Jos Meseguer, Quantitative Analysis of Consistency in NoSQL Key-value Stores, QEST, 2015 (Acceptance Rate 40%). 7. Md Tanvir Al Amin, Shen Li, Muntasir Raihan Rahman, Panindra Tumkur Seetharamu, Shiguang Wang, Tarek Abdelzaher, Indranil Gupta, Mudhakar Srivatsa, Raghu Ganti, Reaz Ahmed, Hieu Le, Social Trove: A Self-Summarizing Storage Service for Social Sensing, Proc ICAC, 2015 (Acceptance Rate 20%), Best Paper Award. 8. Wojciech Golab, Muntasir Raihan Rahman, Alvin Auyoung, Kimberly Keeton, Indranil Gupta, Client-centric Benchmarking of Eventual Consistency for Cloud Storage Systems, Proc. IEEE ICDCS, 2014 (Acceptance Rate 13%). 9. Wojciech Golab, Muntasir Raihan Rahman, Alvin AuYoung, Kimberly Keeton, Xiaozhou Steve Li, Eventually Consistent, Not What You Were Expecting?, CACM, 2014. 10.Si Liu, Muntasir Raihan Rahman, Stephen Skeirik, Indranil Gupta, Jos Meseguer, Formal Modeling and Analysis of Cassandra in Maude, Proc. ICFEM, 2014, (Acceptance Rate 30%). 11.Brian Cho, Muntasir Raihan Rahman, Tej Chajed, Indranil Gupta, Cristina Abad, Nathan Roberts, Philbert Lin, Natjam: Design and Evaluation of Eviction Policies For Supporting Priorities and Deadlines in Mapreduce Clusters, Proc. ACM SOCC, 2013 (Acceptance Rate 20%). 12.Muntasir Raihan Rahman, Wojciech Golab, Alvin AuYoung, Kimberly Keeton, Jay J Wylie, Toward a Principled Framework for Benchmarking Consistency, Proc. HotDep, collocated with OSDI 2012.

44