MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik MapReduce Concept from functional programming Applied to large number of problems Java:
int fooA(String[] list) { return bar1(list) + bar2(list); } int fooB(String[] list) { return bar2(list) + bar1(list); } Do they give the same result? Functional Programming: fun fooA(l: int list) = bar1(l) + bar2(l)
fun fooB(l: int list) = bar2(l) + bar1(l) They do give the same result! Functional Programming Operations do not modify data structures:
They always create new ones Original data still exists in unmodified form Functional Updates Do Not Modify Structures fun foo(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) foo: a -> a list -> a list
The foo() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst! Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice.
What is the type of this function? x: a f: a -> a DoDouble: (a -> a) -> a -> a map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map Implementation fun map f []
= [] | map f (x::xs) = (f x) :: (map f xs) This implementation moves left-to-right across the list, mapping elements one at a time
But does it need to? Implicit Parallelism In map In a functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements
If order of application of f to elements in list is commutative, we can reorder or parallelize execution Reduce Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list f
f f f f returned
initial Order of list elements can be significant Fold left moves left-to-right across the list Again, if operation commutative order not important
MapReduce Motivation: Large Scale Data Processing Google: 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk~4 months to read the web ~1,000 hard drives to store the web
Even more to do something with the data Web data sets are massive Tens to hundreds of terabytes
Cannot mine on a single server Standard architecture emerging commodity clusters Cluster of commodity Linux nodes Gigabit
ethernet interconnect How to organize computations on this architecture? Mask issues such as hardware failure Traditional big-iron box (circa 2003) 8 2GHz Xeons 64GB RAM
8TB disk $758,000 USD Prototypical Google rack (circa 2003) 176 2GHz Xeons 176GB RAM ~7TB disk 278,000 USD
In Aug 2006 Google had ~450,000 machines Prototypical architecture The Challenge: Large-scale data-intensive computing commodity
hardware process huge datasets on many computers, e.g., data mining Challenges: How do you distribute computation?
Distributed/parallel programming is hard Single machine performance should not matter / incremental scalability Machines fail Map-reduce addresses all of the above
Elegant way to work with big data Idea: collocate computation and data (Store files multiple times for reliability)
Need: Programming model Map-Reduce Infrastructure File system: Google: GFS, Hadoop: HDFS
Runtime engine MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers *
Reduce (k, *) Notation: * -- a list * -- a list * Reduce (k, *)
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, intermediate_value_list): // output_key: a word // intermediate_value_list: a list of ones int result = 0; for each v in intermediate_values:
result += ParseInt(v); Emit(output_key, AsString(result)); Reversed Web-Link Graph: For a list of web pages produce the set of pages that have links that point to each of these pages. Email me your solution (pseudocode) by the end of Thursday
27/02 Key ideas behind map-reduce Key idea 1: Separate the what from the how MapReduce abstracts away the distributed part of the system details
are handled by the framework However, in-depth knowledge of the framework is key for performance Custom data reader/writer Custom data partitioning
Memory utilization * -- a list * * Reduce (k, *) * Key idea 2: Move processing to the data
Drastic departure from high-performance computing model HPC: distinction between processing nodes and storage nodes. Designed for CPU intensive tasks Data intensive workloads
Generally not processor demanding The network and I/O are the bottleneck MapReduce assumes processing and storage nodes to be
co-located: (data locality) Distributed lesystems are necessary Key idea 3: Scale out, not up! For data-intensive workloads, a
large number of commodity servers is preferred over a small number of high-end servers cost of super-computers is not linear Some numbers
Processing data is quick, I/O is very slow: 1 HDD = 75 MB/sec; 1000 HDDs = 75 GB/sec Data volume processed: 80 PB/day at Google; 60TB/day at Facebook (~2012) Key idea 4 Shared-nothing infrastructure (both hard- and soft-ware)
Sharing vs. Shared nothing: Sharing: manage a common/global state Shared nothing: independent entities, no common state Functional programming as key enabler
No side effects Recovery from failures much easier map/reduce as subset of functional programming More examples Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an
identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs . The reduce function adds together all values for the same URL and emits a pair.
ReverseWeb-Link Graph: The map function outputs pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: Term-Vector per Host:
More info MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, http://labs.google.com/papers/mapreduce.html The Google File System
Sanjay Ghemawat, Howard Gobioff, and ShunTakLeung, http://labs.google.com/papers/gfs.html