bigdata

Hive Overview

Apache Hive is a framework for data warehousing on top of Hadoop. Hive was created at Facebook and Hive grew from a need to to manage and learn from huge amounts of data that Facebook was generating every day. Facebook chose to built it on top of Hadoop for storage and processing since it was cost effective and met scalability requirements.

SQL Skills

Hive was created to make it possible for analysts with strong SQL skills (but average Java skills) to run queries of huge volumes of data that was stored in HDFS.

💡

Hive was created to make it possible for analysts with strong SQL skills (but average Java skills) to run queries of huge volumes of data that was stored in HDFS

Pros and Cons of using SQL

Pros

Good for many analyses
Advantage of being very well known
SQL is standard in BI (Business Analytics) toolkit. Therefore Hive is well positioned to be integrated with BI tools.

Cons

SQL isn’t ideal for every problem statement. For example not suitable at all for machine learning algorithms.

Hive Architecture

Hive Components

Hadoop Core Components (HDFS and Map Reduce)
Metastore
Driver
Hive Clients

Under the hood

When we load data into a Hive table the data is actually stored in HDFS. When we run a HiveQL query, it internally converts the query into Map Reduce job and execute the job on data stored in HDFS.

Metastore

As the name suggests, the metastore stores meta data about Hive tables, columns, partitions, locations etc… Usually metastore itself is stored in a RDBMS. In standalone version of Hive it is stored as an embedded derby database. In production MySQL database is used to store metastore.

Driver

Driver is the component that parses the query, does semantic analysis on different query blocks and expressions and eventually generates an execution plan with help of table and partition metadata from metastore.

Hive Clients

There are two different types of Hive Clients.

Command Line Interface: Hive Shell, Beeline
Web Interface: Hue, Ambari etc...

Clients connect to metastore database as well as HDFS for actual data.

Comparison with RDBMS

Hive represents traditional database (RDBMS) in many ways. For example: it supports SQL.

It differs majorly because of it’s HDFS and Map Reduce underpinnings. Number of architectural differences have influenced which features Hive supports. With each passing year Hive looks and feels like RDBMS as limitations are removed.

Schema Enforcement: Read v/s Write Time

In RDBMS, schema is enforced at data load time. insert into… statement reject data if it is not adhering to schema. This design is called schema on write because data is checked against schema when it is written.

Hive does not verify data when it is loaded. It is checked when query is issued. This design is called schema on read.

Trade-Offs: Schema on Read

The initial load is very fast. Since data does not have to be read, parsed and stored in RDBMS internal format. Load operation is just a file copy. Also it is flexible. We can have two schemas for the same underlying data.

Trade-Offs: Schema on Write

Query time performance is faster. Because the system can index column and perform compression. The drawbacks are that it takes longer to load data into database. Also it cannot handle scenarios where schema is not known at load time. Also cannot apply indexes as queries have not been formulated yet.

Database Core Features in Hive

Updates, Transactions and Indexes are core features of traditional database management systems. But until recently they were not considered for Hive because Hive was built to operate over HDFS data using Map Reduce and philosophically HDFS and Map Reduce are data analysis tools and whatever data transformation is required it results in new files in HDFS.

Updates and Transactions

Hive has long supported insert of new rows in bulk. Now it is possible to update and delete rows. Since HDFS does not support in-place file updates, changes resulting from insert, update and delete are stored in separate “delta files”. Delta files are periodically merged into base table files by separate Map Reduce jobs that are run in background by metastore. These features work in context of transactions, so tables that use these features should have transactions enabled. Queries reading the table are guaranteed a consistent snapshot.

Indexes

Hive Indexes can speed up query performance. There are two types of indexes in Hive. Compact and Bitmap.

Compact Indexes: Store HDFS block numbers for each value. (Rather than each file offset). Compact Indexes don't take too much disk space and are effective, especially in case where values are clustered together in nearby rows.
Bitmap Indexes: They use compressed bit sets to efficiently store the rows that particular value appears in. There are usually appropriate for low cardinality columns like list of countries, blood group, gender etc...