Monthly Archives: September 2012

HBase and Hadoop Family & Terminology

1) Lily: Smart data, at scale, made easy

2) Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

3) MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

4) Hive: Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

5) Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)

6) HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.

7) Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

8) Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive — then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

9) Whirr: Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports all major virtualized infrastructure vendors on the market.

10) Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.

11) Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

12) Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

13) BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop’s sub-projects and related components with the goal improving the Hadoop platform as a whole.

14) Spark: Spark is a scalable data analytics platform that incorporates primitives for in-memory computing and therefore exercises some performance advantages over Hadoop’s cluster storage approach. It is implemented in Scala, which provides a unique environment for data processing. It is used by the applications which requires to iterate a set of read-only datasets many times to finish the interactive machine learning jobs. The performance of the first iteration is lower than Hadoop, then as the intermediate result is reused, the following iterations shows much better performance. The three typical features in it are: Resilient distributed datasets (RDDs), broadcast variables and accumulator. It relies on Mesos (cluster manager) for resource sharing and isolation.




Leave a comment

Posted by on September 30, 2012 in HBase


Investigation about Spatial data in NoSQL databases

1 Neo4j is a product of graphs databases. You can add different indexing schemes dynamically as you go. Here is the blog for Neo4j

2 CouchDB has a spatial extension

GeoCouch Talk:

GeoCouch Talk Audio:

3 Cassandra has an extension

Leave a comment

Posted by on September 28, 2012 in Uncategorized


Book: Mindfulness for Dummies

This book is very useful to me. Thanks Fan Xie’ recommendation. I will continue this post with my reading progress.

Here, I just want to excerpt some points I think it is very important to me:

1 Mindfulness can improve productivity.

To be mindful, you usually need to do one thing at a time. When walking, you just walk. When listening, you just listen. When writing, you just write. By practicing formal and informal meditation, you are training your brain. You are training it to pay attention with mindful attitudes like kindness, curiosity and acknowledgement.

You can train attention, just as you can train your biceps in a gym. Meditation is gym for the mind.

2 Overcoming Challenges

  You are running to achieve goals so that you can be peaceful and happy, but actually you are running away from the peace and happiness. Mindfulness is an invitation to stop running and rest.

3 Calming the Mind

  If you are anxious, you may just block your understanding. A martial arts student went to his teacher and said earnestly, “I’m devoted to studying your martial system. How long will it take me to master it?” The teacher’s reply was casual, ‘Ten years.’ Impatiently, the student answered, “But I want to master it faster than that. I’ll work very hard. I’ll practise every day, ten or more hours a day if I have to. How long will it take then?” The teacher thought for a moment and replied, “Twenty years.”

Leave a comment

Posted by on September 28, 2012 in book, Uncategorized


People judge you by the words you use

People judge you by the words you use. It makes me scared, worried, and inspired. My posts here bear witness to my English skill.


Tips for how to be Articulate (

  • Keep up to date with current events and know your history. This is not mandatory, but assists in intelligent conversation. What use is your speaking ability if you have nothing to speak about?
  • Know the difference between sounding articulate and just trying to sound educated. Using big words = educated. Using words that everyone understands = articulate. Adding unassociated statistics = educated. Knowing the small details of your position = articulate.
  • If you are incapable of anything regarding articulation such as, can’t eliminate verbal pauses, can’t think before you speak, have a weak vocabulary, can’t speak without slang or vulgarities etc. DO NOT DESPAIR! By simply reading aloud any professional writing such as a book, newspaper, or article you can instantaneously possess all the aforementioned qualities an articulate speaker should have!
    • The key to becoming an independent articulate speaker however, is to look up words and correct pronunciation for which you are unfamiliar with, and to refine your pronunciation as you go along reading aloud more often. Just like physical exercise you’ll notice your voice gaining strength and through practice your brain will become accustomed to speaking articulately. Training your voice can be fun just as any artist can develop and hone a unique style, but know knowledge and consistency are king. By reading aloud you’ll strengthen your voice and gain knowledge at the same time! There are great role-models, but in the end you have to put in the effort! Either through conversation or by reading aloud, PRACTICE.
  • If you find it difficult to stop saying ‘um’ etc out loud, try thinking the word instead.
Leave a comment

Posted by on September 23, 2012 in Life


Tons of Papers

This post is to record the papers I like:
1) Spanner: Google’s Globally-Distributed Database

2) Designs, Lessons,Advices from building large distributed system
3) HBase on Facebook

4) Parallel Distributed database system: the future of high performance database system

5) CMU Parallel Data Lab Publication List

6) Above the clouds: A view of cloud computing

Leave a comment

Posted by on September 18, 2012 in Resource, Uncategorized


My Publications

[1] A Three-Dimensional Data Model in HBase for Large Time-Series Dataset: mesoca-han-dan-hbase-model Slides: a 3-dimensional data model in hbase for large time-series dataset

[2] Understanding Android Fragmentation with Topic Analysis of Vendor-Specific Bugs Understanding Android Fragmentation with Topic Analysis of Vendor-Specific Bugs

Leave a comment

Posted by on September 15, 2012 in Uncategorized


How to prepare Presentation effectively

I am not good at presentation, and I always get nervous from the moment I am informed that I should give a presentation. Whenever I am told to present something, the whole week before the presentation gets messy. Obviously, it is not right. I absolutely have to sit down and think about it.

So this time, I decided to find the root cause. I spent a whole week in preparing a presentation, and I observed myself. I realized that it took me four days to prepare the slides. And the first version of slides was finished in midnight on the fourth day, the deadline I made by myself.

Is it called as Procrastination? I don’t think so because I did not delay the work. The problem is I cannot do it in an efficient way. I started to recall what I did in the last three days.

I started the preparation one week earlier. I first listed a lots of questions for the content I should present, and then I started to find the answers. I googled them with several combination of key words, read lots of articles, and downloaded and read several papers. This step took me three days. Why did I want to do this? Because I wanted to explain what I want to say. Because I was worried that I could not include all possible cases. Because I was afraid that I could not totally convince everyone. The inertia of debating leaded me to collect as much information as possible. I totally suffered the excess collection of evidence.

Basically, I got the mistake of Perfect. I wanted to make a perfect slides, so I started and explored all questions I can think about and tried to answer them. At the last minute, Reference [2] saved me and pulled me back. “Accept that there’s no such thing as a perfect presentation“.

Therefore, I got the lesson. There is no perfect presentation. No one can make a perfect presentation at the first version. You cannot convince everyone with your little work. But what you have to do is to explain it as clear as possible with the presentation. Your presentation can be improved in the process of practice. Nervousness can be overcome by well preparation. Notes and practicing can also help you trim your ideas. So the first version of slides cannot spend you two hours. Most of the time should be spent in the iteration of practicing and slides.

1 How to make an effective PowerPoint Presentation

2 How to save time preparing a presentation

3 Prepare Presentation Advice

Leave a comment

Posted by on September 14, 2012 in Life