RSS

HBase Schema Highlights

13 Aug

0 When to use HBase

  • Storing large amounts of data (100s of TBs)
  • need high write throughput
  • need efficient random access (key lookups)
  • need to scale gracefully with data
  • for structured and semi-structured data
  • don’t need full RDMS capabilities (cross row/cross table transactions, joins, etc.)

1 Every region is served by one and only one region

2 An ideal cell size would probably be the size of a block, so 64KB including the keys.

NOTE: It gives a hint about how to organize your data via (row, column, version) design

3 Context on HBase http://ctolist.com/2012/05/context-on-hbase/

Lots of companies are mentioned there which are using HBase.

http://java.dzone.com/videos/hbase-schema-design-things-you

4 HBase schema in sematext

http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

4 HBase ecosystem http://nosql.mypopescu.com/post/1541593207/quick-reference-hadoop-tools-ecosystem

5 An reasonable explanation about column family in HBase

George, here’s a presentation I gave about understanding HBase schemas from HBaseCon 2012:

http://www.hbasecon.com/sessions/hbase-schema-design-2/

In short, each row in HBase is actually a key/value map, where you can have any number of columns (keys), each of which has a value. (And, technically, each of which can have multiple values with different timestamps).

Additionally, “column families” allow you to host multiple key/value maps in the same row, in different physical (on disk) files. This helps optimize in situations where you have sets of values that are usually accessed disjointly from other sets (so you have less stuff to read off disk). The trade off is that, of course, it’s more work to read all the values in a row if you separate columns into two column families, because there are 2x the number of disk accesses needed.

Unlike more standard “column oriented” databases, I’ve never heard of anyone creating an HBase table that had a column family for every logical column. There’s overhead associated with column families, and the general advice is usually to have no more than 3 or 4 of them. Column families are “design time” information, meaning you must specify them at the time you create (or alter) the table.

Generally, I find column families to be an advanced design option that you’d only use once you have a deep understanding of HBase’s architecture and can show that it would be a net benefit.

So overall, while it’s true that HBase can act in a “column oriented” way, it’s not the default nor the most common design pattern in HBase. It’s better to think of it as a row store with key/value maps.

Advertisements
 
Leave a comment

Posted by on August 13, 2012 in HBase

 

Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: