A MapReduce job is a unit of work that
the client wants to be performed: it consists of the input data, the MapReduce
program, and configuration information. Hadoop runs the job by dividing it into
tasks, of which there are two types: map tasks and reduce tasks.
- The Map task takes a set of data and
converts it into another set of data, where individual elements are broken
down into tuples (key-value pairs).
- The Reduce task takes the output
from the Map as an input and combines those data tuples (key-value pairs)
into a smaller set of tuples. The reduce task is always performed after
the map job.
Input Phase: Hadoop divides the input to a
MapReduce job into fixed-size pieces called input splits, or just
splits. RecordReader will read the given file and split the file into
[K,V] pairs depending on the separator. K will be the start
byteoffset and V will be end byteoffset.
Map Phase: Hadoop creates one map task for
each split, which runs the userdefined map function for each record in the
split. NO.OF MAPPERS=NO.OF INPUT SPLITS.
Supplied K,V pairs will be further
mapped to List[K,V] pairs depending on any separator. Generated List[K,V]
pairs will be called as Intermediate Keys which will be sent to Reducer.
Before Reduce, Shuffle&Sort will
organise the data to K,List[V] pairs and will be passed to Reducer.
Reducer will apply the further processing logic
and will be deriving the final output.
Partitioner:
When there are multiple reducers, the
map tasks partition their output, each creating one partition for each reduce
task. There can be many keys (and their associated values) in each partition,
but the records for any given key are all in a single partition. The
partitioning can be controlled by a user-defined partitioning function, but
normally the default partitioner—which buckets keys using a hash function—works
very well.
Combiner:
Many MapReduce jobs are limited by the
bandwidth available on the cluster, so it pays to minimize the data transferred
between map and reduce tasks. Hadoop allows the user to specify a combiner
function to be run on the map output—the combiner function’s output forms the
input to the reduce function. Since the combiner function is an optimization,
Hadoop does not provide a guarantee of how many times it will call it for a
particular map output record, if at all. In other words, calling the combiner
function zero, one, or many times should produce the same output from the reducer.
Why the optimal split size is the same as the block size?
It is the largest size of input that
can be guaranteed to be stored on a single node. If the split spanned two
blocks, it would be unlikely that any HDFS node stored both blocks, so some of
the split would have to be transferred across the network to the node running
the map task, which is clearly less efficient than running the whole map task
using local data.
Hadoop does its best to run the map
task on a node where the input data resides in HDFS. This is called the data
locality optimization since it doesn’t use valuable cluster bandwidth.
Sometimes, however, all three nodes hosting the HDFS block replicas for a map
task’s input split are running other map tasks so the job scheduler will look
for a free map slot on a node in the same rack as one of the blocks. Very
occasionally even this is not possible, so an off-rack node is used, which
results in an inter-rack network transfer.
Map tasks write their output to the local disk, not
to HDFS. Why is this?
Map output is intermediate output: it’s processed
by reduce tasks to produce the final output, and once the job is complete the
map output can be thrown away. So storing it in HDFS, with replication, would
be overkill. If the node running the map task fails before the map output has
been consumed by the reduce task, then Hadoop will automatically rerun the map
task on another node to re-create the map output.
Reduce tasks don’t have the advantage of data
locality—the input to a single reduce task is
normally the output from all mappers. The sorted map outputs have to be
transferred across the network to the node where the reduce task is running,
where they are merged and then passed to the user-defined reduce function. The
output of the reduce is normally stored in HDFS for reliability. For each HDFS
block of the reduce output, the first replica is stored on the local node, with
other replicas being stored on off-rack nodes. Thus, writing the reduce output
does consume network bandwidth, but only as much as a normal HDFS write
pipeline consumes.
The series of steps that undergoes in MapReduce
programming is as below:
1. A MapReduce job usually splits
the input data-set into independent chunks which are processed by the map
tasks in a completely parallel manner.
2. The framework sorts the outputs of the maps,
which are then input to the reduce tasks.
Typically both the input and the output of the job
are stored in a file-system. The framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
As seen in earlier posts,The MapReduce framework
consists of a single master JobTracker and one
slave TaskTracker per cluster-node. The master is responsible for
scheduling the jobs, component tasks on the slaves, monitoring them and
re-executing the failed tasks. The slaves execute the tasks as directed by the
master.
3. The Hadoop job client then
submits the job (jar/executable etc.) and configuration to
the JobTracker which then assumes the responsibility of distributing
the software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.