Anatomy of Map Reduce refers to the fundamental
architecture and data flow of the Map Reduce programming model, which is used
for processing large data sets in a distributed computing environment.
A Map
Reduce job breaks down a large data processing task into two main phases: the Map
Phase and the Reduce Phase. This
allows for parallel processing across many machines, which is faster and
more reliable than processing with a single machine.
Job Tracker (Master
Node)
Job Tracker acts as the master,
coordinating all jobs by receiving client requests, dividing them into tasks,
assigning tasks to nodes, and monitoring their progress.
Role: A
central orchestrator for the entire Map Reduce process.
Responsibilities:
•
Receives and accepts Map Reduce job requests from
clients.
•
Determines data locations by consulting the Name Node.
•
Divides the job into smaller, executable map and reduce
tasks.
•
Schedules these tasks to run on available Task Trackers,
prioritizing data locality.
•
Monitors the overall progress of the job and the status of
individual tasks.
•
Handles task failures by re-scheduling failed tasks on
different Task Trackers.
•
Reports the final status of the job back to the client.
Task Tracker (Slave Nodes)-The Task Tracker, running on each
slave/data node, acts as the slave by executing the assigned map and reduce
tasks, performing computations on local data, and sending regular status
updates and "heartbeats" to the Job Tracker.
Role: The distributed
workhorses that execute the individual tasks.
Responsibilities:
•
Runs on each data node within the cluster, often on every
data node.
•
Executes the map and reduce tasks assigned to it by the Job
Tracker.
•
Performs the actual data processing and computation on the
data stored on its node.
•
Sends periodic "heartbeat" signals to the Job
Tracker to report its status and to notify the Job Tracker that it is still
alive and functional.
• Reports progress updates and task completion status back to the Job Tracker.
Input Splits and
Input Format:
The input data (text files) is first divided into smaller, manageable units called input splits by the Input Format. Each input split is processed by
single Map task.
Mapping Phase:
Each
Map task receives an input split and processes it independently. The Mapper
function takes each line from the split, tokenizes it into words, and emits
key-value pairs where the key is the word and the value
is 1 (representing one occurrence).
Shuffling and Sorting Phase:
•
This
is the intermediate phase between Map and Reduce.
The intermediate key-value pairs from all Map tasks are then shuffled and
sorted. This phase groups all values associated with the same key
together.
•
The output from the mappers is partitioned, ensuring that all
values for a specific key go to the same Reducer.
•
Shuffle: Redistributes data based on the
intermediate keys so that all values for the same key go to the same reducer.
•
Sort: Groups the values for each unique
key. This is automatically handled by the Map Reduce framework.
•
After shuffling and sorting, the data for our example would
look something like this, ready for the Reducers.
Reducing Phase:
•
Each Reduce task receives a key and a list of values
associated with that key.
•
The Reducer function processes this data, typically by
aggregating or combining the values
•
Each reducer takes a key and its list of values and processes
them to generate final output.
•
Final
key-value pairs from reducers are written to HDFS or another storage system.
•
Output
format can be customized, but typically stored as text or sequence files.
Example
1
Example
2
Steps of Map Reduce Job Execution flow
Input Splits
↓
[Mapper]
↓
Intermediate <k, v> pairs
↓
Shuffle & Sort (Group by Key)
↓
[Reducer]
↓
Final Output <k, v>
1. Input Files
In input files data for Map Reduce job
is stored. In HDFS, input files
reside. Input files format is arbitrary. Line-based log files and binary format
can also be used.
2. Input Format
After
that Input Format defines how to split and read these input files. It selects
the files or other objects for input. Input Format creates Input Split.
3. Input Splits
It represents the data which will be
processed by an individual Mapper. For
each split, one map task is created. Thus the number of map tasks is equal to
the number of Input Splits. Framework divide split into records, which mapper
process.
4. Record Reader
It communicates with the input Split.
And then converts the data into key-value pairs suitable
for reading by the Mapper. Record Reader by default uses Text Input Format to
convert data into a key-value pair.
It
communicates to the Input Split until the completion of file reading. It
assigns byte offset to each line present in the file. Then, these key-value
pairs are further sent to the mapper for further processing.
5. Mapper
It
processes input record produced by the Record Reader and generates intermediate
key-value pairs. The intermediate output is completely different from the input
pair. The output of the mapper is the full collection of key-value pairs.
Hadoop
framework doesn’t store the output of mapper on HDFS. It doesn’t store, as data
is temporary and writing on HDFS will create unnecessary multiple copies. Then
Mapper passes the output to the combiner for further processing.
4. Combiner
Combiner
is Mini-reducer which performs local aggregation on the mapper’s output. It minimizes
the data transfer between mapper and reducer. So, when the combiner
functionality completes, framework passes the output to the partitioner for
further processing.
5. Partitioner
Partitioner
comes into the existence if we are working with more than one reducer. It takes
the output of the combiner and performs partitioning.
Partitioning
of output takes place on the basis of the key in Map Reduce. By hash function,
key (or a subset of the key) derives the partition.
On
the basis of key value in Map Reduce, partitioning of each combiner output
takes place. And then the record having the same key value goes into the same
partition. After that, each partition is sent to a reducer.
Partitioning
in Map Reduce execution allows even distribution of the map output over the
reducer.
6. Shuffling and
Sorting
After
partitioning, the output is shuffled to the reduce node. The shuffling is the
physical movement of the data which is done over the network. As all the
mappers finish and shuffle the output on the reducer nodes.
Then
framework merges this intermediate output and sort. This is then provided as
input to reduce phase.
7. Reducer
Reducer
then takes set of intermediate key-value pairs produced by the mappers as the
input. After that runs a reducer function on each of them to generate the
output.
The
output of the reducer is the final output. Then framework stores the output on
HDFS.
8. Record Writer
It
writes these output key-value pair from the Reducer phase to the output files.
9. Output Format
Output
Format defines the way how Record Reader writes these output key-value pairs in
output files. So, its instances provided by the Hadoop write files in HDFS.
Thus Output Format instances write the final output of reducer on HDFS.
Map Reduce type and format
Data Types
used for keys and values (both in map and reduce phases)
Input and
Output Formats that define how data is read and written
•
Input: Offset (LongWritable), Line
(Text)
•
Intermediate: Word (Text), Count
(IntWritable)
•
Output: Word (Text), Total Count (IntWritable)
Input Format
KeyValue
Text Input Format:
·
Parses each line
into key and value separated by a delimiter (default tab).
·
Key: Text before
delimiter.
·
Value: Text after
delimiter.
Sequence
File Input Format:
·
For reading
Hadoop Sequence Files (binary key-value format).
Output Format
Text Output Format |
|
||
Key-value pair as a line of text, separated
by a tab character by default. Sequence File Output Format |
|
||
Hadoop
Sequence File (binary format). It’s Useful for intermediate data or efficient
processing.
Map Reduce Features
•
Scalability:
Map
Reduce can handle vast amounts of data by distributing processing tasks across
numerous servers in a cluster, allowing it to scale horizontally.
•
Parallel
Processing:
The
framework processes data in parallel across multiple machines, enabling faster
data analysis compared to traditional single-server approaches.
•
Fault Tolerance:
Built-in
features like data replication and automatic task re-execution on different
nodes ensure that processing continues even if a node fails.
•
Simple Programming
Model:
Developers
can focus on the input and output of their map and reduce functions, with the
framework handling the complexities of distributed execution.
•
Cost-Effective:
By
leveraging clusters of commodity servers instead of high-end hardware, Map
Reduce provides an affordable solution for big data processing.
•
Flexibility:
The
versatile programming model can be used with various data sources and formats,
allowing businesses to work with new data types and sources.
•
Data Locality:
Map
Reduce tries to process data as close to its physical storage location as
possible, reducing the need to move large datasets across the network.
•
Reliability:
Replication
of datasets across multiple nodes ensures data availability and prevents data
loss, even in the event of a node failure.
Assignment Question
1) What
is the Map Reduce programming model?
2) Explain
the roles of the Map, Shuffle
and Sort, and Reduce
phases in the Map Reduce framework.
3) Describe
the data flow in a Map Reduce job
4) Explain
the purpose of the Map and Reduce functions.
5) Explain
the difference between the Mapper and Reducer functions in MapReduce
6) How does the Shuffle and Sort phase work in MapReduce?
7) How
does fault tolerance work in MapReduce?
8) Explain Map Reduce
Features
9) What is Input and
output format of Map Reduce?
10)
What are the responsibility of job tracer and task tracker?
0 comments:
Post a Comment