MapReduce: Simplified Data Processing on Large Clusters
Abstract
- processing and generating large data sets
-
map: process a K/V pair -> generate a set of intermediate K/V pairs reduce: merge intermediate K/V pairs-
Programs:
- automatically parallelized
- executed on a large cluster of machines
-
Runtime:
- partitioning input data
- scheduling program's execution across a set of machine
- handling machine failures
- managing the required inter-machine communication
Motivation
- input data is large + computations have to be distributed
New abstraction
- express simple computation
- hides the messy details of parallelization
- fault tolerance
- data distribution + load balancing in library
Programming model
- computation takes a set of input K/V pairs -> produces a set of output K/V pairs
-
user expresses the computation as two functions:
MapandReduce
Example
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Types
map(k1,v1) -> list(k2,v2)
reduce(k2, list(v2)) -> list(v2)
More examples
- Distributed Grep
- Count of URL Access Frequency
- Reverse Web-Link Graph
- Term-Vector per Host
- Inverted Index
- Distributed Sort
Implementation
-
Different implementations of MapReduce interface are possible,
depends on the environment
- small shared-memory machine
- large NUMA multi-processor
- large collection of networked machines
Execution Overview
-
Map: distributed across multiple machines by automatically partitioning the input - Input splits can be processes in parallel by different machines
-
Reduceinvocations are distributed by partitioning the intermediate key intoRpieces- using partitioning function
hash(key) mod R -
number of partitions
R+partition functionare specified by the user
- using partitioning function
-
Sequence:
-
MapReduce library in the user program first split the input
files into M pieces
- starts up many copies of the program on a cluster of machines
-
Master: a special copies of the program- other copies are
workers -
masterassign works to theworkers -
M Maptasks andR Reducetasks to assign -
masterpicks idle workers and assign each one amaptask and areducetask
- other copies are
-
workerwho is assigned amaptask- reads the contents of the corresponding input split
-
parse K/V pairs and passes each pairs to the user-defined
Map -
the intermediate K/V pairs produced my the
Mapfunction are buffered in memory
-
periodically, the buffered pairs are written to local disk,
partitioned into R regions 1. locations of these buffers pairs
on the local disk are passed back to the
master2.masterforwards theses locations to thereduce workers -
reduce workeris notified by the master-
use
rpcto read the buffered data from local disk of the map workers -
reduce workersorts all intermediate data by the intermediate keys- different keys map to the same reduce task
- if data is to large -> an external sort is used
-
use
-
reduce workeriterates over the sorted intermediate data-
for each unique key -> passed the key and the
corresponding set of intermediate values to the users`s
Reducefunction -
the output the
Reducefunction is appended to the final output file for this reduce partition
-
for each unique key -> passed the key and the
corresponding set of intermediate values to the users`s
-
when all
maptasks andreducetasks have been completed ->masterwakes up user program, theMapReducereturns back to user code
-
MapReduce library in the user program first split the input
files into M pieces
- Output of the execution are stored in the R output files -> pass as inputs to other MapReduce call
Master Data structures
-
masterstores the state (idle,in-progress,completed) for each tasks + identity of the worker machines -
for each
completedtasks, masters stores the location and sizes of the R intermediate file regions of produced by map task-
update to these regions -> signal
maptask ascompleted -
pushed to workers that have
in-progressreducetasks
-
update to these regions -> signal
Fault Tolerance
Worker Failure
-
masterpings every worker periodically-
no response from
worker-> marks theworkeras failed -
completed maptasks ->wokerreset back toidle-> eligible for scheduling -
maporreducetask in progress on failed worker -> reset toidle-> rescheduling -
completed maptasks are re-executed on a failure (worker machine) because output is stored on local disks of failed machine -
completed reducedo not need to be re-executed since output is stored in a global file system -
reducetasks that have not read data from failed worker will read from new worker
-
no response from
- resilient to large-scale worker failures
Master Failure
- single master -> failure is unlikely -> aborts the MapReduce computation if the master fails
Semantics in the Presence of Failures
-
Deterministic functions
-
when
mapandreduceare deterministic functions -> produces the same output as a non-faulting sequential execution-
rely on atomic commits of
mapandreducetask
-
rely on atomic commits of
-
Mechanism:
-
in-progress task writes its output to private temporary file
map: R files (one per reduce task)reduce: one file
-
when task completes:
workersends a message tomasterand include the names of the R temp files in the message-
if master have already received a completion
message of already completed map ->
ignore message
-
Potential for Duplicate Messages: It's possible that
a
workermight complete a map task and send a completion message to themasterbefore themasterdetects theworker's failure. Alternatively, a failedworkermight recover and resend a completion message for a task that has already been re-executed and completed on another worker
-
Potential for Duplicate Messages: It's possible that
a
- otherwise, recored the names of R files in master data structure
-
if master have already received a completion
message of already completed map ->
ignore message
-
in-progress task writes its output to private temporary file
-
when
-
Non-deterministic
- output will look like it came from a single, consistent run of your non-deterministic code
- However, different reduce task may end up with different results
Locality
- takes the location information of the input files -> map task on a machine that contains replica of the corresponding input data
Backup Tasks
- straggler: machine that takes unusually long time to complete on of the last computation
-
solution:
-
when MapReduce operation is close to completion -> backup
executions of the remaining
in-progresstasks -
task is mark as
completed: either primary or backup execution completes
-
when MapReduce operation is close to completion -> backup
executions of the remaining
Combiner function
-
significant repetition in the intermediate keys produced by each map
task -> allow user to specify an optional
Combinerfunction that does partial merging of data before send over network -
same code is used for both
CombinerandReducefunction-
Combiner: output written to intermediate file Reduce: output written to final output file
-
Side-effects
- rely on application writer for atomic and idempotent -> writes to a temporary file and atomically renames this file
- do not support atomic two-phase commits of multiple output files produced by a single task -> multiple output files with cross-file consistency should be deterministic
Skipping Bad Records
- acceptable to ignore a few records (statistical analysis on a large data set) -> skip records in order to make forward progress
- worker installs a signal handler that caches segmentation violations and bus errors
How does the MapReduce implementation at Google achieve scalability and performance on large clusters?
Google's implementation of MapReduce is designed to run on large clusters of commodity PCs. Scalability is achieved through automatic parallelization and distribution of tasks across thousands of machines.
Performance is enhanced by several techniques:
- Data locality optimization: Reading input data from local disks to minimize network overhead.
- Combiner function (optional): Reduces the amount of intermediate data transferred between map and reduce phases.
- Dynamic task assignment: Tasks are assigned to workers dynamically for better load balancing.
Additionally, the system includes:
- Fault tolerance mechanisms to handle machine failures gracefully.
- Backup tasks to mitigate the effects of slow or "straggler" machines.
These features ensure efficient processing of massive datasets with high scalability and performance.