Map Reduce and HDFS

Map Reduce and HDFS
Photo by Manuel Nägeli / Unsplash

Background

Map Reduce framework and HDFS go hand-in-hand. The large datasets on which analysis is to be performed usually are saved in HDFS cluster. Each Data Node in the HDFS cluster is also a compute resource and is capable of executing a Map Reduce Job managed by YARN (Yet Another Resource Negotiator).

This allows the framework to effectively schedule tasks on the node where data is already present (DataNodes) resulting in high aggregate bandwidth across the cluster.

Inputs and Outputs

Map reduce framework exclusively operates on Key Value pairs <K,V>. i.e. framework views input to the job as a set of <K,V> pairs and produces an output of other <K,V> pairs as the output of the job. (Of different types)

Map Reduce Job

A MapReduce job is a unit of work that the client wants to be performed. It consists of:

  • Input Data
  • Map Reduce Program
  • Configuration Information

Hadoop runs the job by dividing it into tasks.

Map Task and Reduce Task

Two types of tasks

  • Map Task
  • Reduce Task

The tasks are scheduled using YARN and run on nodes in a cluster. If a task fails it is automatically scheduled to run on a different node.

💡
If a task fails it is automatically scheduled to run on a different node

Input Splits

Hadoop divides input into fixed-size pieces called input splits or just splits. Hadoop creates one Map task for each split. Task runs user defined Map function for each record in the split.

Optimum Split Size

Delicate balance has to be achieved while defining size of each split. The split has to be small enough to leverage parallel processing but not too small that overhead of managing splits and map task creation begins to dominate total execution time. For most jobs a good split size is around the same size as that of HDFS block size.

💡
For most jobs a good split size is around the same size as that of HDFS block size

Data Locality Optimization

Hadoop does it’s best to run map task on node where input data resides. Doesn’t use valuable cluster bandwidth. This is called data locality optimisation. Occasionally a node where a replica of a split exists might not be available to run a map task (because it might be busy with running some other map task). In this case the job scheduler will try to search for a free map slot on a node in the same rack and in some very rare cases an off-rack node might get used.


Map Task Output

Map tasks write their output to local disk and not HDFS. Map task output is intermediate output and transient in nature. It is only used by Reducer task and discarded after use. If node running Map task fails before the intermediate output is consumed by reduce task, Hadoop will automatically run the map task again on another node to re-create output.

Reduce Task Output

Reduce tasks don’t have advantage of data locality. Input to single reduce task is normally the output from all map tasks. Reduce Tasks output is stored in HDFS for reliability.


Number of Map and Reduce tasks

Number of Map tasks is determined by number of splits. Each split has one map task associated with it. Number of Reduce tasks is not governed by size of input but is specified independently. When there are multiple reducers the map tasks partition their output. One partition for each reduce task.

Partitioning of Map task output

There can be many keys in each partition, but records for any given key are all in the same partition. Partitioning can be controlled by user defined function. Normally the default partitioner which uses the hash function to bucket keys works very well.

Combiner Functions

Many Map Reduce jobs are limited by the bandwidth available on the cluster. It is useful to reduce data transfer between map and reduce tasks. Hadoop allows the user to specify a combiner function. Combiner function works on map task output. The output of combiner function is an input to reduce task. Combiner function is an optimisation. Hadoop does not provide a guarantee of how many times it will run.

💡
Combiner function is an optimisation. Hadoop does not provide a guarantee of how many times it will run.