HBase (NoSQL) basics in the Hadoop Ecosystem
For analytical queries, relational databases (RDBMS) are very good. But for very simple queries like “what are the last orders of this customer?”. On a typical relational database, there will be a customer table and the order table which needs to be joined on a particular key to get the relevant result. This operation is not quick with larger volumes of data in relational databases.
Hbase is built on top of HDFS to improve the query performance on the big data that resides in hadoop cluster, which we can also expose to any website or any API. Random access to planet size data, as they say in the book HBase: The Definitive Guide: book. In real life scenario, Hbase is sort of a system that does GET/PUT for a given key that is horizontally partitioned. To answer a very complex query, hive or spark can be used. But for high scale and very high transaction rate , you can use Hbase to answer to that part of the problem.
As you can see in this architecture, which is very simple to understand that the client gets the request and sends the request with the key to find the relevant result from one of the shards. For example, if the data is for a website, export it to NoSql and retrieve it at any time by providing the key. So, is it just a relational database like MySql?And what is Hbase then? Let’s find out.
A non-relational , scalable database is just an API that GETs values and provides the key for those values, or stores the values for those keys. It is based on Google’s big table. Let’s find out what operations it can perform:
CRUD:
1.Create
2.Read
3.Update
4.Delete
There is no query, its just the API. Lets find out how the architecture looks like :
Region Server — these are cluster of machines which are auto-synced that keeps track of all the keys(information) and does the reads and writes on HDFS file system.
HDFS — Hadoop’s distributed file system is the underlying storage which keeps the larger files manageable by these Region Servers
HMaster — like a teacher with an attendance sheet, HMaster keeps track of each region server’s information and metadata. This will help the Hmaster to route the request to any Region Server based on the requested key.
Zookeeper — Someone should overlook the teacher(HMaster) that if anytime HMaster goes down, it replace the Hmaster control to other healthy instances(Region Server).
The client sends a request to get a key for the value from HBase, so the HMaster knows where to route the request. The request then goes to the region server and the region server works with HDFS to provide the information back to the client through an API response.
HBase Data Model
Provided below is an example of a HBase data model for a customer and product dataset.
Provides faster access to a given ROW.
A ROW is referenced by a unique key. (Customer ID)
Each Row has a number of column families.(Customers/Products)
Each column family contains set of columns(for customers column family : set of columns are : Customer Name , City & Country). We can have larger number of columns in a column family.
Each cell have verions differentiated by timestamp based on modified time.
Sharing an example from google’s bigtable paper:
Ways to access Hbase:
HBase shell — interactive command line to perform certain operations .
Java API —HBase is written in java and there are wrappers for Python,Scala and others. Using these programming language you can perform Hbase operations.
Spark/Hive/Pig — Connectors available to access Hbase to create or transform certain views and store them back in Hbase.
REST service — you can communicate to Hbase through a RESTFUL interface using http request.
Protocol buffers — Thrift/Avro service — to store things very quickly using these frameworks.
Jump to Practicals :
We are going to create a HBase table using Python via REST. You may follow these steps to get quick hands-on. Use case : Create a HBase table for movie ratings by customers. Query to find look at all the movie ratings of a specific customer. This will be a good example of sparse data.
Enrivonment : Hortonworks Sandbox with HDP (For installation instructions : https://www.cloudera.com/tutorials.html)
HBase REST Service runs on port 8000 in my local(127.0.0.1) after I have started the HBase.
right click on sandbox -> network → port forwarding -> add ‘HBase REST’ and port : 8000 to set the port forwarding in Virtual machine.
>> /usr/hdp/current/hbase-master/bin/hbase-daemon.sh start rest -p 8000 — infoport 8001
Data file : I have kept the u.data file in my local that contains movie ratings of the customer. for example : C://Users//vijay//u.data
IDE : Intellij with pip installed.
Package : for HBase REST API in Python installed :
>> pip install starbase.
Code :
from starbase import Connection
#create a connection for HBase REST API
c = Connection("127.0.0.1", "8000")
#Creating a table instance
ratings = c.table('ratings')
#check and drop if any table instances with same name exists.
if (ratings.exists()):
print("Dropping existing ratings table\n")
ratings.drop()
#Create the table schema (column family)
ratings.create('rating')
print(" Read the input data \n")
ratingFile = open("C://Users//u.data", "r")
#add batch which multiple records instead of one
batch = ratings.batch()
#storing the data with rating being the column family
#and movieId and rating being the columns.
for line in ratingFile:
(userID, movieID, rating, timestamp) = line.split()
batch.update(userID, {'rating': {movieID: rating}})
ratingFile.close()
print ("Committing ratings data to HBase via REST service\n")
batch.commit(finalize=True)
print ("Get back ratings for some users...\n")
print ("Ratings for user ID 1:\n")
print (ratings.fetch("1"))
#Drops the table instance.
ratings.drop()
Result :
Conclusion:
As we discussed earlier, HBase performs CRUD operations based on a key. Here we have taken userID as a key and retrieved all the column families and related columns to find the movie rating. We basically created the table, updated the records and read them by providing a key and printed them. I hope this article is useful for newcomers.
References for the code & data file :