# Lecture 2 - Theoretical Underpinnings of MapReduce

MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik MapReduce Concept from functional programming Applied to large number of problems Java:

int fooA(String[] list) { return bar1(list) + bar2(list); } int fooB(String[] list) { return bar2(list) + bar1(list); } Do they give the same result? Functional Programming: fun fooA(l: int list) = bar1(l) + bar2(l)

fun fooB(l: int list) = bar2(l) + bar1(l) They do give the same result! Functional Programming Operations do not modify data structures:

They always create new ones Original data still exists in unmodified form Functional Updates Do Not Modify Structures fun foo(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) foo: a -> a list -> a list

The foo() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst! Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice.

What is the type of this function? x: a f: a -> a DoDouble: (a -> a) -> a -> a map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map Implementation fun map f []

= [] | map f (x::xs) = (f x) :: (map f xs) This implementation moves left-to-right across the list, mapping elements one at a time

But does it need to? Implicit Parallelism In map In a functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements

If order of application of f to elements in list is commutative, we can reorder or parallelize execution Reduce Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list f

f f f f returned

initial Order of list elements can be significant Fold left moves left-to-right across the list Again, if operation commutative order not important

MapReduce Motivation: Large Scale Data Processing Google: 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk~4 months to read the web ~1,000 hard drives to store the web

Even more to do something with the data Web data sets are massive Tens to hundreds of terabytes

Cannot mine on a single server Standard architecture emerging commodity clusters Cluster of commodity Linux nodes Gigabit

ethernet interconnect How to organize computations on this architecture? Mask issues such as hardware failure Traditional big-iron box (circa 2003) 8 2GHz Xeons 64GB RAM

8TB disk \$758,000 USD Prototypical Google rack (circa 2003) 176 2GHz Xeons 176GB RAM ~7TB disk 278,000 USD

In Aug 2006 Google had ~450,000 machines Prototypical architecture The Challenge: Large-scale data-intensive computing commodity

hardware process huge datasets on many computers, e.g., data mining Challenges: How do you distribute computation?

Distributed/parallel programming is hard Single machine performance should not matter / incremental scalability Machines fail Map-reduce addresses all of the above

Elegant way to work with big data Idea: collocate computation and data (Store files multiple times for reliability)

Runtime engine MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers *

Reduce (k, *) Notation: * -- a list * -- a list * Reduce (k, *)

map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, intermediate_value_list): // output_key: a word // intermediate_value_list: a list of ones int result = 0; for each v in intermediate_values:

result += ParseInt(v); Emit(output_key, AsString(result)); Reversed Web-Link Graph: For a list of web pages produce the set of pages that have links that point to each of these pages. Email me your solution (pseudocode) by the end of Thursday

27/02 Key ideas behind map-reduce Key idea 1: Separate the what from the how MapReduce abstracts away the distributed part of the system details

are handled by the framework However, in-depth knowledge of the framework is key for performance Custom data reader/writer Custom data partitioning

Memory utilization * -- a list * * Reduce (k, *) * Key idea 2: Move processing to the data

Drastic departure from high-performance computing model HPC: distinction between processing nodes and storage nodes. Designed for CPU intensive tasks Data intensive workloads

Generally not processor demanding The network and I/O are the bottleneck MapReduce assumes processing and storage nodes to be

co-located: (data locality) Distributed lesystems are necessary Key idea 3: Scale out, not up! For data-intensive workloads, a

large number of commodity servers is preferred over a small number of high-end servers cost of super-computers is not linear Some numbers

Processing data is quick, I/O is very slow: 1 HDD = 75 MB/sec; 1000 HDDs = 75 GB/sec Data volume processed: 80 PB/day at Google; 60TB/day at Facebook (~2012) Key idea 4 Shared-nothing infrastructure (both hard- and soft-ware)

Sharing vs. Shared nothing: Sharing: manage a common/global state Shared nothing: independent entities, no common state Functional programming as key enabler

No side effects Recovery from failures much easier map/reduce as subset of functional programming More examples Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an

identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs . The reduce function adds together all values for the same URL and emits a pair.

ReverseWeb-Link Graph: The map function outputs pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: Term-Vector per Host:

Sanjay Ghemawat, Howard Gobioff, and ShunTakLeung, http://labs.google.com/papers/gfs.html

## Recently Viewed Presentations

• (Schofield and Thoburn, 1996:18) Whose expectations? Adults and children are following different agendas Judges and GALs felt that by hearing the voice of the child better decisions are made in the child's interests For children, the value is in being...
• Estimation of the Moisture Content in Typical MSW Problem Statement Calculate the composition and moisture content of typical MSW is given in the Table 1.
• NHSE view of STPs "STPs are more than just plans" STP is the route map for the local NHS and its partners . System-wide plan setting out locally agreed objectives and how activity will flow between care settings
• In the back of your exercise book... Grammar Starter: Analogies. Definition: An analogy is a comparison between two things that are usually thought to be different from each other but that have something in common.
• The Organization]: including work patterns, the culture of the workplace, resources, communications, leadership and so on. Such factors are often overlooked during the design of jobs but have a significant influence on individual and group behaviour.
• We created multiple dashboards, one in Epic as seen above in the first screenshot is an operational dashboard. The Epic Dashboard contains registry based real time reports that end users can run ad hoc. All TSOC Program end users have...
• I'M SAFE Checklist. Facility Information . Administrative Information. Human Resources . Triage Acuity. Equipment. Environment. Call a Huddle! Status of Team's Patient(s) Goal of Team. Tasks/Actions That Are or Need To Be Completed . Plan Still Appropriate.
• Presentation to the Louisiana Tax InstituteCouncil On State Taxation & the Multistate Tax Commission. October 30, 2017. FEDERAL PARTNERSHIP AUDITS AND THE IMPACT ON STATE REVENUE DEPARTMENTS