Self Learn Hadoop: Map Reduce

A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.

The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).

The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples. The reduce task is always performed after the map job.

Input Phase: Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. RecordReader will read the given file and split the file into [K,V] pairs depending on the separator. K will be the start byteoffset and V will be end byteoffset.

Map Phase: Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split. NO.OF MAPPERS=NO.OF INPUT SPLITS.

Supplied K,V pairs will be further mapped to List[K,V] pairs depending on any separator. Generated List[K,V] pairs will be called as Intermediate Keys which will be sent to Reducer.

Before Reduce, Shuffle&Sort will organise the data to K,List[V] pairs and will be passed to Reducer.

Reducer will apply the further processing logic and will be deriving the final output.

Partitioner:

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well.

Combiner:

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

Why the optimal split size is the same as the block size?

It is the largest size of input that can be guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split would have to be transferred across the network to the node running the map task, which is clearly less efficient than running the whole map task using local data.

Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization since it doesn’t use valuable cluster bandwidth. Sometimes, however, all three nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.

Map tasks write their output to the local disk, not to HDFS. Why is this?

Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to re-create the map output.

Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers. The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability. For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth, but only as much as a normal HDFS write pipeline consumes.

The series of steps that undergoes in MapReduce programming is as below:

1. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

2. The framework sorts the outputs of the maps, which are then input to the reduce tasks.

Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

As seen in earlier posts,The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs, component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

3. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Self Learn Hadoop

Thursday, July 7, 2016

Map Reduce

No comments:

Post a Comment