hbase

HBase Overview

Introduction

HBase is a distributed column oriented database built over HDFS. Although the terminology in HBase is of a "table","row","column" etc... HBase should not be compared with a Relational Database Managment System like Oracle or MySQL.

It is easiest to think of HBase as a persistent Hashtable or Map datastructure. HBase is used when real time, read/write random access is required for a very large dataset.

Indexed By

Row Key
Column Key
Timestamp

Thinking about HBase

HBase data model is sparse, distributed, persistent, multi-dimensional, sorted, map.

Map

In HBase a "table" is a "hashtable" or a "map". a HBase table is made up of keys and values.

Sparse

Not all values need to be present. (There can be holes)

Distributed

One "table" is usually present on several nodes (of HDFS or Hadoop cluster)

Persistent

Data is persisted (i.e. it is not an in-memory data structure)

Multi-Dimensional

In HBase table the value itself can be map or hashtable.

Sorted

Stored in increasing order of row key. This facilitates easier fetching of multiple rows in a region.

Namespaces

A namespace is a logical grouping of tables analogous to a database in RDBMS. HBase has planned a set of nifty features around namespace for future releases.

Quota Management: Restrict amount of resources regions and tables can consume
Namespace Security: Allow / Deny access to certain users associated with a namespace
Region Server Groups: Pinning a namespace to a subset of region servers to provide coarse level isolation

Predefined Namespaces

There are two pre-defined namespaces.

hbase - system namespace, used to contain internal HBase tables
default - tables without explicit namespace will fall into this namespace

Definitions

Table

HBase organizes data into tables. Table Names are strings and are made of characters which are safe to be stored in a file system.

Row

Within table, data is stored according to it’s row. Row is uniquely identified by Row Key. Row Key does not have a data type and is always identified as byte[]

Column Family

Column Family is a concept unique to HBase. Column family names are strings and are made up of characters which are safe to be stored in file system. Column families impact physical arrangement of data. Therefore they have to defined upfront and cannot be easily modified. Every row has the same number of column families. Although it’s okay not to have data for some column families.

Column Qualifier (Column)

Data within column families is identified by a column or column qualifier. Column Qualifier need not be listed in advance. Column Qualifiers need not be consistent across rows. Not all rows need to have all column qualifiers within a column family. Like row keys column qualifiers have no data type and are considered as byte[].

Cell

A unique combination of row key, column family and column qualifier. Data stored in a cell is that cell’s “value”. Cell’s value does not have a data type. It is considered as byte[]

Versioning (Timestamp)

Values within cells are versioned. Versions are identified by their version number. By default it is the timestamp of when the cell was written. Number of versions maintained by HBase is configurable for each column family (default 3).

Data Model Operations

HBase supports four data model operations.

Put
Get
Scan
Delete

Put

Adds a new row to a table if row key is new or updates an existing row if the row key already exists

Get

Returns attributes for a specific row

Scan

Iterates of multiple rows for specified attributes

Delete

Removes a row from table

NoSQL

HBase is a type of distributed NoSQL database (more accurately a data store). HBase lacks features of RDBMS like Typed Columns, Secondary Indexes, Triggers etc...

Storage Mechanism

Physically all column family members are stored together in the file system. Tuning and storage specifications are done at column family level. It is advised that all column qualifiers within a column family have same size characteristics and same access patterns. For Example: photos (binary data) are stored in a different column family than personal information (text).

Regions

Tables are partitioned automatically into regions. Regions comprise of a subset of table’s rows. A region is uniquely identified by: Table Name, First Row (Inclusive) and Last Row (Exclusive). Initially a table starts with one region. As region grows and it crosses a configurable threshold, it is split at row boundary into two regions of approximately equal size. As table grows so do the regions. Regions are distributed across the cluster and that’s how HBase Data Model is distributed.

Atomicity

Atomicity is guaranteed only at a row level. There is no atomicity guarantee across rows, which means that there are no multi-row transactions.

HBase Architecture

Similar to HDFS which is made up of clients, workers and coordinating master: Name Node and Data Nodes, HBase is made up of HBase Master Node and one or more Region Server workers.

It comprises of three major components:

HBase Master
Region Server
Zookeeper

HBase Master

HBase Master is responsible for bootstrapping a new install. It is also responsible for assigning regions to registered Region Servers and managing Region Server failures and recovering from Region Server failures. HBase Master also handles DDL operations like creating and deleting tables and managing meta data operations.

Region Servers

Region Servers carry zero or more regions. They field client read and write requests and manage region splits. It does it by informing HBase master about new daughter regions so it can manage off-lining of parent regions and assignment of replacement daughters.

Region Server Components

Handling user requests (read and write) is a complex task. Therefore region server is further divided into four components:

Block Cache: Recently read data is stored in block cache
Memstore: It is a write cache storing data not yet stored in drive
WAL (Write Ahead Log): WAL is attached to every region server and stores temporary data that is not yet committed to drive. WAL is a backup to Memstore in case Region Server crashes before Memstore is committed to HFile
HFile: Stores all actual data persistently after committed by the system

Zookeeper

Zookeeper acts as a bridge for communication between various HBase components. It is responsible for keeping track of all Region Servers and regions. Monitoring active and failed nodes HBase Master and Region Servers.