Introduction to Hadoop

Introduction to Hadoop
Photo by Wolfgang Hasselmann / Unsplash

What is Hadoop ?

Hadoop is a collection of open source software utilities that facilitate solving big data problems using a network of computers. Hadoop comprises of a software framework for distributed storage (HDFS) and a software framework for query and analysis (Map Reduce).

Design Principal

💡
Hardware failure are common occurrences and are automatically handled by the framework.

Hadoop Core

Hadoop Distributed File System (HDFS) splits files into large blocks and distributes them across nodes in a cluster. The Map Reduce programming model then transfers packaged code to nodes to process data in parallel. The design paradigm is to move code near data rather than move data near code.

💡
The design paradigm is to move code near data rather than move data near code.

Comparison of Hadoop with other systems

Hadoop is not the first distributed system for data storage and analysis. But there are certain unique properties of Hadoop which made it successful as compared to other systems. In the section we will compare Hadoop to other systems.


Relational Database Management System

We will start by asking a very simple question. Why can't we use databases with lots of disks to do large scale data analysis ? To answer to this question we need to understand trends in disk drives. In disk technology, transfer rates are improving faster than seek time.

Transfer Rate: How much time does it take to read data. (Disk Bandwidth)
Seek Time: Process of moving disk's head to particular place on disk to read and write data.

In Relational Database Management System, data access pattern is dominated by seek time rather than transfer rate (Random reads and writes are more common than sequential access). Therefore for big data analysis, relational database systems (with performance bottleneck associated with seek time) are not a good solution.

Comparing Hadoop and RDBMS

In-fact Hadoop and RDBMS are complimentary technologies. RDBMS works well for random queries or updates where the operation is being performed on a relatively small dataset whereas Hadoop works well for problems that need to analyze whole dataset in a batch fashion. Hadoop works well for datasets which are written once and read many times. RDBMS works well for datasets which are continually updated. Hadoop can work on semi-structured or unstructured data. RDBMS insists of highly structured (relational) data. Hadoop scales linearly with size of data. RDBMS does not scale linearly.


Grid Computing

High-Performance Computing (HPC) has been doing large scale data processing for years. Broadly the approach of HPC is to distribute work across cluster of machines with access to shared file system hosted by a Storage Area Network (SAN). Grid computing works best for computationally intensive jobs.

💡
Grid computing works best for computationally intensive jobs. Moving data to code.

Examples of problem statements where grid computing works best are: Remote Sensing, Number Crunching (measured in petaflops - Number of floating point operations per second).

Grid computing does not work well when large amounts of data need to be accessed. (Network bandwidth starts becoming a problem and compute nodes start remaining idle).

💡
Grid computing does not work well when large amounts of data need to be accessed.

Comparing Hadoop and Grid Computing

Location of Data

In Hadoop the paradigm is to move code towards data. In grid computing the paradigm is to move data towards code.

Programming

API exposed by grid computing is Message Passing Interface (MPI). MPI gives great control to programmers but they have to worry about mechanics of data flow. Also programmers have to handle partial failures, worry about order of execution etc...

In Hadoop programmers only have to think in terms of data models. The actual data flow is implicit.


History of Hadoop

Hadoop was created by Doug Cutting (creator of Apache Lucene). Hadoop has origins in Apache Nutch project, an open source web search engine. The core components of Hadoop: HDFS and Map Reduce are based on two papers published by Google describing GFS and Map Reduce respectively. Hadoop was initially developed under the aegis of Yahoo! but later was moved under Apache. Today many large and small companies use Hadoop for their big data problems.