Introduction to Hadoop
What is Hadoop ?
Hadoop is a collection of open source software utilities that facilitate solving big data problems using a network of computers. Hadoop comprises of a software framework for distributed storage (HDFS) and a software framework for query and analysis (Map Reduce).
Design Principal
Hadoop Core
Hadoop Distributed File System (HDFS) splits files into large blocks and distributes them across nodes in a cluster. The Map Reduce programming model then transfers packaged code to nodes to process data in parallel. The design paradigm is to move code near data rather than move data near code.
Comparison of Hadoop with other systems
Hadoop is not the first distributed system for data storage and analysis. But there are certain unique properties of Hadoop which made it successful as compared to other systems. In the section we will compare Hadoop to other systems.
Relational Database Management System
We will start by asking a very simple question. Why can't we use databases with lots of disks to do large scale data analysis ? To answer to this question we need to understand trends in disk drives. In disk technology, transfer rates are improving faster than seek time.
In Relational Database Management System, data access pattern is dominated by seek time rather than transfer rate (Random reads and writes are more common than sequential access). Therefore for big data analysis, relational database systems (with performance bottleneck associated with seek time) are not a good solution.
Comparing Hadoop and RDBMS
In-fact Hadoop and RDBMS are complimentary technologies. RDBMS works well for random queries or updates where the operation is being performed on a relatively small dataset whereas Hadoop works well for problems that need to analyze whole dataset in a batch fashion. Hadoop works well for datasets which are written once and read many times. RDBMS works well for datasets which are continually updated. Hadoop can work on semi-structured or unstructured data. RDBMS insists of highly structured (relational) data. Hadoop scales linearly with size of data. RDBMS does not scale linearly.
Grid Computing
High-Performance Computing (HPC) has been doing large scale data processing for years. Broadly the approach of HPC is to distribute work across cluster of machines with access to shared file system hosted by a Storage Area Network (SAN). Grid computing works best for computationally intensive jobs.
Examples of problem statements where grid computing works best are: Remote Sensing, Number Crunching (measured in petaflops - Number of floating point operations per second).
Grid computing does not work well when large amounts of data need to be accessed. (Network bandwidth starts becoming a problem and compute nodes start remaining idle).
Comparing Hadoop and Grid Computing
Location of Data
In Hadoop the paradigm is to move code towards data. In grid computing the paradigm is to move data towards code.
Programming
API exposed by grid computing is Message Passing Interface (MPI). MPI gives great control to programmers but they have to worry about mechanics of data flow. Also programmers have to handle partial failures, worry about order of execution etc...
In Hadoop programmers only have to think in terms of data models. The actual data flow is implicit.
History of Hadoop
Hadoop was created by Doug Cutting (creator of Apache Lucene). Hadoop has origins in Apache Nutch project, an open source web search engine. The core components of Hadoop: HDFS and Map Reduce are based on two papers published by Google describing GFS and Map Reduce respectively. Hadoop was initially developed under the aegis of Yahoo! but later was moved under Apache. Today many large and small companies use Hadoop for their big data problems.