Paging is a very common use-case for web sites and many other applications. In relational databases, this is easily implemented with LIMIT and OFFSET, or by selecting the row number in the query and adding conditionals based on it’s value. HBase 0.19.x, on the other hand, does not provide any queries or filters that support paging directly. After a quick example using SQL, I will show how to implement the same functionality in HBase.
Let’s assume that we have a large number of users. Each user has performed a number of actions. Each action has a unique identifier, a timestamp, and a name.
This is how you might get the third page of an individual users’ actions using SQL:
This utilizes secondary indexes on both userid and stamp, meaning to accomplish this query you need at least three indexes on this table as id is the primary key. Though a simple query to write, you will run into problems as the actions table grows to millions of rows and beyond. Insertions would look like:
HBase has no real indexes. Rows are stored in sorted order, and columns in a family are sorted. For more information, read the page on the .
Very conscious of the primary queries we will run on user-actions, we will design an HBase table to support paging queries on per-user, time-ordered lists of actions.
We will use the Java Client API for HBase, specifically the HTable class. What we are looking for are two methods:
Please note, I am using a custom object, Action, for simplicity. It is a client-side holder for the four action fields (id, userid, name, stamp).
There are a number of ways to store your data in HBase that will allow the getUserActions query, but in this case we will go with a very tall table design (lots of rows with few columns in them) rather thanwide (lots of columns in each row). Specifically, the difference here would be whether you have a row-per-action or a row-per-user. We will do a row-per-action, but will be designing our row key (the primary key) to be a composite key to allow for grouping and sorting of actions, rather than just the action id. This means we will not have random-access to an action by it’s id, so rather than defining this as the actions table (which might also exist if you needed actionid random access) we will define it as the useractions table, and we will only store a single column in a single family,content:name.
The row key that we will use in our HBase useractions table is:
It’s very important that each of these fields is fixed-length and binary so that the lexicographical/ascending byte-ordering of HBase will properly sort our rows.
The userid field will be a 4 byte, big endian integer. reverse_order_stamp is an 8 byte, big endian long with a value of (Long.MAX_VALUE - epoch). This is so the most recent stamp is at the top rather than the bottom. actionid is another 4 byte, big endian integer. Thankfully, HBase provides helpful utilties in the org.apache.hadoop.hbase.util.Bytes class to deal with this (unfortunately it lacked some key features in 0.19, so the code below makes use of the Bytes class available in 0.20/TRUNK). Before we get into HBase code, let’s define the helper methods makeActionRow and readActionRow to deal with the composite key:
Now that we can deal with the composite keys, insertion is very straightforward:
We just serialize the fields into the composite row, and write the single column to HBase in a BatchUpdate. Reading will deal with unserializing the fields and Scanners. In addition to matching for the content:name column, we will also specify a startRow and stopRow so that the Scanner only returns results from the user we are looking at. This way we do not have to worry about jumping to the next user in our code, the Scanner will just stop.
The storage of your data must be tied to how you need to query it. Without a sophisticated query engine or indexing capabilities, you must design to take advantage of sorted rows and columns, potentially designing a table per query type. Denormalization is okay!
In my next posts, I will show more interesting ways to use HBase for persisted dictionary/keyval/Object storage and directly address secondary indexing with HBase.
Disclaimer: The code in these examples is designed to illustrate the practical use of HBase. While the design is sound, the code itself may not optimized for performance.