Map Reduce

Anatomy of Map Reduce refers to the fundamental architecture and data flow of the Map Reduce programming model, which is used for processing large data sets in a distributed computing environment.

A Map Reduce job breaks down a large data processing task into two main phases: the Map Phase and the Reduce Phase. This allows for parallel processing across many machines, which is faster and more reliable than processing with a single machine.

Job Tracker (Master Node)

Job Tracker acts as the master, coordinating all jobs by receiving client requests, dividing them into tasks, assigning tasks to nodes, and monitoring their progress.

Role: A central orchestrator for the entire Map Reduce process.

Responsibilities:

• Receives and accepts Map Reduce job requests from clients.

• Determines data locations by consulting the Name Node.

• Divides the job into smaller, executable map and reduce tasks.

• Schedules these tasks to run on available Task Trackers, prioritizing data locality.

• Monitors the overall progress of the job and the status of individual tasks.

• Handles task failures by re-scheduling failed tasks on different Task Trackers.

• Reports the final status of the job back to the client.

Task Tracker (Slave Nodes)-The Task Tracker, running on each slave/data node, acts as the slave by executing the assigned map and reduce tasks, performing computations on local data, and sending regular status updates and "heartbeats" to the Job Tracker.

Role: The distributed workhorses that execute the individual tasks.

Responsibilities:

• Runs on each data node within the cluster, often on every data node.

• Executes the map and reduce tasks assigned to it by the Job Tracker.

• Performs the actual data processing and computation on the data stored on its node.

• Sends periodic "heartbeat" signals to the Job Tracker to report its status and to notify the Job Tracker that it is still alive and functional.

• Reports progress updates and task completion status back to the Job Tracker.

Phases of a Map Reduce (Job):

Input Splits and Input Format:

The input data (text files) is first divided into smaller, manageable units called input splits by the Input Format. Each input split is processed by

single Map task.

Mapping Phase:

Each Map task receives an input split and processes it independently. The Mapper function takes each line from the split, tokenizes it into words, and emits key-value pairs where the key is the word and the value is 1 (representing one occurrence).

Shuffling and Sorting Phase:

• This is the intermediate phase between Map and Reduce. The intermediate key-value pairs from all Map tasks are then shuffled and sorted. This phase groups all values associated with the same key together.

• The output from the mappers is partitioned, ensuring that all values for a specific key go to the same Reducer.

• Shuffle: Redistributes data based on the intermediate keys so that all values for the same key go to the same reducer.

• Sort: Groups the values for each unique key. This is automatically handled by the Map Reduce framework.

• After shuffling and sorting, the data for our example would look something like this, ready for the Reducers.

Reducing Phase:

• Each Reduce task receives a key and a list of values associated with that key.

• The Reducer function processes this data, typically by aggregating or combining the values

• Each reducer takes a key and its list of values and processes them to generate final output.

• Final key-value pairs from reducers are written to HDFS or another storage system.

• Output format can be customized, but typically stored as text or sequence files.

Example 1

Example 2

Steps of Map Reduce Job Execution flow

Input Splits

↓

[Mapper]

↓

Intermediate <k, v> pairs

↓

Shuffle & Sort (Group by Key)

↓

[Reducer]

↓

Final Output <k, v>

1. Input Files

In input files data for Map Reduce job is stored. In HDFS, input files reside. Input files format is arbitrary. Line-based log files and binary format can also be used.

2. Input Format

After that Input Format defines how to split and read these input files. It selects the files or other objects for input. Input Format creates Input Split.

3. Input Splits

It represents the data which will be processed by an individual Mapper. For each split, one map task is created. Thus the number of map tasks is equal to the number of Input Splits. Framework divide split into records, which mapper process.

4. Record Reader

It communicates with the input Split. And then converts the data into key-value pairs suitable for reading by the Mapper. Record Reader by default uses Text Input Format to convert data into a key-value pair.

It communicates to the Input Split until the completion of file reading. It assigns byte offset to each line present in the file. Then, these key-value pairs are further sent to the mapper for further processing.

5. Mapper

It processes input record produced by the Record Reader and generates intermediate key-value pairs. The intermediate output is completely different from the input pair. The output of the mapper is the full collection of key-value pairs.

Hadoop framework doesn’t store the output of mapper on HDFS. It doesn’t store, as data is temporary and writing on HDFS will create unnecessary multiple copies. Then Mapper passes the output to the combiner for further processing.

4. Combiner

Combiner is Mini-reducer which performs local aggregation on the mapper’s output. It minimizes the data transfer between mapper and reducer. So, when the combiner functionality completes, framework passes the output to the partitioner for further processing.

5. Partitioner

Partitioner comes into the existence if we are working with more than one reducer. It takes the output of the combiner and performs partitioning.

Partitioning of output takes place on the basis of the key in Map Reduce. By hash function, key (or a subset of the key) derives the partition.

On the basis of key value in Map Reduce, partitioning of each combiner output takes place. And then the record having the same key value goes into the same partition. After that, each partition is sent to a reducer.

Partitioning in Map Reduce execution allows even distribution of the map output over the reducer.

6. Shuffling and Sorting

After partitioning, the output is shuffled to the reduce node. The shuffling is the physical movement of the data which is done over the network. As all the mappers finish and shuffle the output on the reducer nodes.

Then framework merges this intermediate output and sort. This is then provided as input to reduce phase.

7. Reducer

Reducer then takes set of intermediate key-value pairs produced by the mappers as the input. After that runs a reducer function on each of them to generate the output.

The output of the reducer is the final output. Then framework stores the output on HDFS.

8. Record Writer

It writes these output key-value pair from the Reducer phase to the output files.

9. Output Format

Output Format defines the way how Record Reader writes these output key-value pairs in output files. So, its instances provided by the Hadoop write files in HDFS. Thus Output Format instances write the final output of reducer on HDFS.

Map Reduce type and format

Data Types used for keys and values (both in map and reduce phases)

Input and Output Formats that define how data is read and written

• Input: Offset (LongWritable), Line (Text)

• Intermediate: Word (Text), Count (IntWritable)

• Output: Word (Text), Total Count (IntWritable)

Input Format

KeyValue Text Input Format:

· Parses each line into key and value separated by a delimiter (default tab).

· Key: Text before delimiter.

· Value: Text after delimiter.

Sequence File Input Format:

· For reading Hadoop Sequence Files (binary key-value format).

Output Format

Text Output Format

Key-value pair as a line of text, separated by a tab character by default.

Sequence File Output Format

Hadoop Sequence File (binary format). It’s Useful for intermediate data or efficient processing.

Map Reduce Features

• Scalability:

Map Reduce can handle vast amounts of data by distributing processing tasks across numerous servers in a cluster, allowing it to scale horizontally.

• Parallel Processing:

The framework processes data in parallel across multiple machines, enabling faster data analysis compared to traditional single-server approaches.

• Fault Tolerance:

Built-in features like data replication and automatic task re-execution on different nodes ensure that processing continues even if a node fails.

• Simple Programming Model:

Developers can focus on the input and output of their map and reduce functions, with the framework handling the complexities of distributed execution.

• Cost-Effective:

By leveraging clusters of commodity servers instead of high-end hardware, Map Reduce provides an affordable solution for big data processing.

• Flexibility:

The versatile programming model can be used with various data sources and formats, allowing businesses to work with new data types and sources.

• Data Locality:

Map Reduce tries to process data as close to its physical storage location as possible, reducing the need to move large datasets across the network.

• Reliability:

Replication of datasets across multiple nodes ensures data availability and prevents data loss, even in the event of a node failure.

Assignment Question

1) What is the Map Reduce programming model?

2) Explain the roles of the Map, Shuffle and Sort, and Reduce phases in the Map Reduce framework.

3) Describe the data flow in a Map Reduce job

4) Explain the purpose of the Map and Reduce functions.

5) Explain the difference between the Mapper and Reducer functions in MapReduce

6) How does the Shuffle and Sort phase work in MapReduce?

7) How does fault tolerance work in MapReduce?

8) Explain Map Reduce Features

9) What is Input and output format of Map Reduce?

10) What are the responsibility of job tracer and task tracker?

Milan Tomic

Hi. I’m Designer of Blog Magic. I’m CEO/Founder of ThemeXpose. I’m Creative Art Director, Web Designer, UI/UX Designer, Interaction Designer, Industrial Designer, Web Developer, Business Enthusiast, StartUp Enthusiast, Speaker, Writer and Photographer. Inspired to make things looks better.

Modern Computer science Studies