MapReduce: Simplified Data Processing on Large Clusters

Abstract

Motivation

New abstraction

Programming model

Example

map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
        EmitIntermediate(w, "1");


reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
        result += ParseInt(v);
    Emit(AsString(result));

Types

map(k1,v1) -> list(k2,v2)
reduce(k2, list(v2)) -> list(v2)

More examples

Implementation

Execution Overview

Master Data structures

Fault Tolerance

Worker Failure

Master Failure

Semantics in the Presence of Failures

Locality

Backup Tasks

Combiner function

Side-effects

Skipping Bad Records

How does the MapReduce implementation at Google achieve scalability and performance on large clusters?

Google's implementation of MapReduce is designed to run on large clusters of commodity PCs. Scalability is achieved through automatic parallelization and distribution of tasks across thousands of machines.

Performance is enhanced by several techniques:

Additionally, the system includes:

These features ensure efficient processing of massive datasets with high scalability and performance.