bigdata

Explaining Map Reduce

Hadoop Map Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data sets) in parallel, on large clusters of commodity hardware (thousands of nodes) in a reliable and fault tolerant manner.

A Map Reduce Job splits the input data-set into independent chunks which are processed by the map tasks in completely parallel manner. The framework sorts the output of map tasks which are then input to reduce tasks.

The Map and Reduce functions can be represented by following:

Explaining Map Reduce

This post attempts to explain map reduce in simple terms.

Problem Statement

It is easiest to understand Map Reduce by walking through an example. Let's consider a deck of 52 playing cards. Each deck has 4 suits: Hearts, Diamonds, Spades and Clubs. Each suit has 13 cards: 2 to 10 numeric cards, Jack, Queen, King and Ace.

The problem statement is to find sum of all numeric cards by suit. Therefore the expected answer is : Hearts (2+3+4+5+6+7+8+9+10) = 54. Similarly Diamonds = 54, Spades = 54 and Clubs = 54.

Intuition

The first step in decomposing a problem statement and evaluating whether Map Reduce paradigm can be applied or not, is to determine what (if any) steps can be parallelized. (Map Reduce is only suitable when a big chunk of data processing can be parallelized)

Considering the above example, since "each" playing card is independent and has no dependency on any "other" playing card, the summation activity can happen in parallel. (Note: it is not important whether the sum needs to be taken per suit or for the entire deck)

Lets imagine we have to solve this problem "manually" and we have 3 people who can do the task. How will we go about it ?

The first step is to randomly divide 52 playing cards into 3 piles. (As there are 3 people. This is the split part in Map Reduce.
Each person will independently calculate sum of all the cards in his/her pile per suit. This is the Map function in Map Reduce and multiple copies of the same function will be executed in parallel.
Once all 3 people are finished, there will be a co-ordination activity where sums per suit of all 3 people are combined together to get ultimate answer. This is the Reduce function in Map Reduce.

Map Function

Definition

Input

For our specific problem statement, in whatever form the input might be, if we transform the input into a csv file with one record per line in the format: <suit>,<card_value> then we can use the csv file as an input to the map function. In this case the key: k1 will be one of 4 options: Spades or Diamonds or Hearts or Clubs and the value: v1 will be the actual card value. For example we can split the 52 playing cards and one of the splits can look like this:

Processing

In this example, the work, the map function will do, is do a summation of the numeric cards per suit that it receives. It will acheive this by first discarding inputs cards with value Jack, Queen, King or Ace and then adding values of remaining numeric cards. For example if the map function received the input represented above, then after summation of numeric cards per suit, it will store the following as an intermediate state:

Intermediate State in Map Task

Output

As per definition the output of map function is list(k2,v2). Therefore in the example, it will be a list of all suits with intermediate summation of the cards it has received.

Output of Map Task

Reduce Function

Defintion

Input

The input to Reduce function is the output of all Map function executions combined together by key: k2. Therefore value: v2 will be a list whose size will be equal to number of map tasks.

Processing

The goal of our program is to do summation per suit. The "per suit" aspect is already handled by key: k2. Therefore all the reduce function has to do is go through the list and perform summation.

Output

The reduce function will output the summation value.

Summary

Map Reduce paradigm takes advantage of clusters to machines to perform analysis of large datasets. The steps of analysis which can be executed in parallel are done so using the Map function. Subsequently the Reduce function combines results of multiple executions of the Map function to provide intended output.

Explaining Map Reduce

Explaining Map Reduce

Problem Statement

Intuition

Map Function

Definition

Input

Processing

Output

Reduce Function

Defintion

Input

Processing

Output

Summary

Read next

Apache Airflow

HDFS Java API

Stream Processing