RSS

Category Archives: Cloud Computing

Distributed System and Network

HBase and Hadoop Family & Terminology

1) Lily: Smart data, at scale, made easy http://www.lilyproject.org/lily/index.html

2) Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

3) MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

4) Hive: Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

5) Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)

6) HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.

7) Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

8) Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive — then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

9) Whirr: Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports all major virtualized infrastructure vendors on the market.

10) Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.

11) Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

12) Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

13) BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop’s sub-projects and related components with the goal improving the Hadoop platform as a whole.

14) Spark: Spark is a scalable data analytics platform that incorporates primitives for in-memory computing and therefore exercises some performance advantages over Hadoop’s cluster storage approach. It is implemented in Scala, which provides a unique environment for data processing. It is used by the applications which requires to iterate a set of read-only datasets many times to finish the interactive machine learning jobs. The performance of the first iteration is lower than Hadoop, then as the intermediate result is reused, the following iterations shows much better performance. The three typical features in it are: Resilient distributed datasets (RDDs), broadcast variables and accumulator. It relies on Mesos (cluster manager) for resource sharing and isolation.

Reference

http://wikibon.org/wiki/v/HBase,_Sqoop,_Flume_and_More:_Apache_Hadoop_Defined

http://www.ibm.com/developerworks/library/os-spark/

http://spark-project.org/documentation/

 

 

 
Leave a comment

Posted by on September 30, 2012 in HBase

 

How to upload the data efficiently in HBase

1  This blog introduces the way to import large data into hbase fast http://nosql.mypopescu.com/post/1679204468/hbase-importing-large-data

  • Combine no WAL, mapreduce and the latest update for HBase
  • Some configuration for some specific parameters

Explanation for how to make insertion faster http://people.apache.org/~jdcryans/HUG8/HUG8-rawson.pdf

2 Bulk Importing Data into HBase

http://nosql.mypopescu.com/post/9034681586/bulk-importing-data-into-hbase

This is the way suggested in HBase wiki http://archive.cloudera.com/cdh/3/hbase/bulk-loads.html

 
Leave a comment

Posted by on September 4, 2012 in HBase

 

How to design table schema in HBase for various applications

This is what I am trying to solve during my Master study now.

My supervisor and I got a publication about how to store the large-scale time-series data in HBase. It proposes that the third dimension “Timestamps” in HBase which is used to control concurrency is designed to store one dimension of the datasets. We did the experiments with different schemas, which demonstrates that the different data schemas has different impact on performance (execution time of query).It also discusses how to design the row key, column and version dimension.

The interesting thing is that today, I just found a slide from Facebook, which mentioned about the different usage of the third dimension of HBase as well. Check it out if you are interested in it. http://qconlondon.com/dl/qcon-london-2011/slides/KannanMuthukkaruppan_HBaseFacebook.pdf

And now, I am heading on the location/space data, and investigating how to store the large-scale space data in HBase. And also investigate how to design the data schema in HBase for spatio-temporal datasets/applications.

I will update the result soon here ……

 
Leave a comment

Posted by on September 4, 2012 in Cloud Computing, HBase

 

Tags:

Various HBase Clients

Hbase another library

: python library for HBase http://happybase.readthedocs.org/en/latest/index.html

REST – Stargate

1 start: hbase rest start -p <port number> end: hbase rest stop
2 Get Software Version
curl http://localhost:8000/version
curl -H "Accept: application/json" http://localhost:8000/version
3 Get Storage Cluster Version
curl http://localhost:8000/version/cluster
4 Get Storage Cluster Status
curl -H "Accept: text/xml" http://localhost:8000/status/cluster
Result includes: number of request, number of regions, average load, region name on each node
5 Get Table list
curl -H "Accept: application/json" http://localhost:8000/
Result is like: {“Table”:[{“name”:”content”},{“name”:”urls”}]}
6 Get Table Schema
curl http://localhost:8000//schema
7 Get Table Metadata
curl -H "Accept: application/json" http://localhost:8000/content/regions

 
Leave a comment

Posted by on September 3, 2012 in HBase

 

Tags:

HBase Regions and Region Servers

Common Question: How to determine the right number of regions in the Region Server ?

1 What is Region? How is region assigned to Region Server http://hbase.apache.org/book/regions.arch.html

2 Bigger or Smaller Region? http://hbase.apache.org/book/important_configurations.html#bigger.regions

The data organization can be affected with the regions distribution?

Monitor the regions.

 
Leave a comment

Posted by on August 19, 2012 in HBase

 

HBase Monitoring

1 HBase Monitoring in the book http://hbase.apache.org/book/ops.monitoring.html

2 HBase provides JMX interface to monitor it and there is a tool can be used http://blog.monitis.com/index.php/2012/03/28/monitoring-hbase/

http://my.safaribooksonline.com/book/databases/database-design/9781449314682/10dot-cluster-monitoring/id3289013

There is a tool kit for accessing MBeans: https://github.com/larsgeorge/jmxtoolkit

The remote service format in jconsole or JMXTookit: service:jmx:rmi:///jndi/rmi://127.0.0.1:10101/jmxrmi

jmx command line client : http://crawler.archive.org/cmdline-jmxclient/downloads.html

3 Visualize the log of HBase can also be a helper to diagnose the problem and root cause. (This is what an intern in our lab did in the summer 2012 using Timeline. Since it has some performance issue, I will continue to make it workable.)

4 Visualize the Hadoop file system might also be a plus to do performance tuning.

http://engineering.twitter.com/2012/08/visualizing-hadoop-with-hdfs-du.html

5 how to monitor hadoop with Catci

http://www.jointhegrid.com/development/

 
Leave a comment

Posted by on August 19, 2012 in HBase

 

Tags:

HBase Schema Highlights

0 When to use HBase

  • Storing large amounts of data (100s of TBs)
  • need high write throughput
  • need efficient random access (key lookups)
  • need to scale gracefully with data
  • for structured and semi-structured data
  • don’t need full RDMS capabilities (cross row/cross table transactions, joins, etc.)

1 Every region is served by one and only one region

2 An ideal cell size would probably be the size of a block, so 64KB including the keys.

NOTE: It gives a hint about how to organize your data via (row, column, version) design

3 Context on HBase http://ctolist.com/2012/05/context-on-hbase/

Lots of companies are mentioned there which are using HBase.

http://java.dzone.com/videos/hbase-schema-design-things-you

4 HBase schema in sematext

http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

4 HBase ecosystem http://nosql.mypopescu.com/post/1541593207/quick-reference-hadoop-tools-ecosystem

5 An reasonable explanation about column family in HBase

George, here’s a presentation I gave about understanding HBase schemas from HBaseCon 2012:

http://www.hbasecon.com/sessions/hbase-schema-design-2/

In short, each row in HBase is actually a key/value map, where you can have any number of columns (keys), each of which has a value. (And, technically, each of which can have multiple values with different timestamps).

Additionally, “column families” allow you to host multiple key/value maps in the same row, in different physical (on disk) files. This helps optimize in situations where you have sets of values that are usually accessed disjointly from other sets (so you have less stuff to read off disk). The trade off is that, of course, it’s more work to read all the values in a row if you separate columns into two column families, because there are 2x the number of disk accesses needed.

Unlike more standard “column oriented” databases, I’ve never heard of anyone creating an HBase table that had a column family for every logical column. There’s overhead associated with column families, and the general advice is usually to have no more than 3 or 4 of them. Column families are “design time” information, meaning you must specify them at the time you create (or alter) the table.

Generally, I find column families to be an advanced design option that you’d only use once you have a deep understanding of HBase’s architecture and can show that it would be a net benefit.

So overall, while it’s true that HBase can act in a “column oriented” way, it’s not the default nor the most common design pattern in HBase. It’s better to think of it as a row store with key/value maps.

 
Leave a comment

Posted by on August 13, 2012 in HBase

 

Tags: