RSS

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

31 Jul

This is one topic in HBase Conference 2012. The talk is from Yapmap. In this talk, Jacques Nadeau, CTO from yapmap talked about lots of details of HBase.

This is the video link: http://www.cloudera.com/resource/hbasecon-2012-building-a-large-search-platform-on-a-shoestring-budget-video/

Abstract and Description of this talk

YapMap is a new kind of search platform that does multi-quanta search to better understand threaded discussions. This talk will cover how HBase made it possible for two self-funded guys to build a new kind of search platform. We will discuss our data model and how we use row based atomicity to manage parallel data integration problems. We’ll also talk about where we don’t use HBase and instead use a traditional SQL based infrastructure. We’ll cover the benefits of using MapReduce and HBase for index generation. Then we’ll cover our migration of some tasks from a message based queue to the Coprocessor framework as well as our future Coprocessor use cases. Finally, we’ll talk briefly about our operational experience with HBase, our hardware choices and challenges we’ve had.

He developed the talk with the following six aspects: what is Yapmap, interfacing with Data, Using Hbase as a data processing pipeline, NoSQL Schemas: Adjusting and Migrating, Index Construction, and HBase operations

1 what is YapMap?

It is a visual search technology. It is to show the threaded conversations in forums with better context and ranking. It is built on Hadoop & Hbase for massive scale. Motoyap.com is a largest implementation at 650mm automobile docs.

The motivation of the product is because the discussion forums and mailing list primary home for many hobbies. Another reason is that threaded search sucks because there is no context in the middle of the  conversation.

1.2 Conceptual data model

  • Entire Thread is MainDocGroup
  • For long threads, a single group may have multiple MainDocs
  • Each individual post is a DetailDoc

1.3 General Architecture

1.4 Match the tool to the user case

2  Interfacing with data

2.1 HBase client is a power user interface

Problems: 1) Hbase client interface is low-level, it is similar to JDBC/SQL

2) Most people start use it with Bytes.to(String…) It is called Spaghetti data layer

3) New developers have to learn benches of new concepts

4) Mistakes are easy to make

2.2 We built a DTO layer to simplify dev

1) Data Transfer Objects & data access layer provide single point for code changes and data migration

2) First-class row key object

3) Centralized type serialization: standard data types & complex object serialization layer via protobuf

4) Provide optimistic locking (It is not clear to me.)

5) Asynchronous operation

6) newer tools to ease this burden

2.3 Examples for the DTO data model and the primary key design

3 Using HBase as a data processing pipeline

3.1 Processing pipeline is built on HBase

3.2 Migrating from messaging to coprocessors

Big challenges: 1) mixing the system code and application code ; 2)memory impact: We have a GC stable state

4 NoSQL Schemas: Adjusting and Migrating

4.1 Learn to leverage NoSQL strenghths 

This part is to describe how they change the data schema in NoSQL database

4.2 Schema Migration Steps

As they have to migrate the data from schema1 to schema2, they need to migrate the data

5 Index Construction

5.1 Index Shards loosely based on HBase Regions

1) Indexing is split between major (batch) indices and minor (real time)

2) Primary key order is same as index order

3) Shards are based on snapshots of splits

4) IndexedTableSplit allows cross-region shard splits to be integrated at index load time

5.2 Batch Indices are memory based, stored on DFS  

1) Total of all shards about 1TB

2) Each shard about 5GB in size to improve parallelism on search

3) Each shard is composed of multiple map and reduce parts and MapRd statistics from HBase

6 HBase Operation

1) Getting GC right

2) Region size right

3) Cautious about multiple Column Familiies

4) Consider Asynchbase client

5) Upgrade of HBase

6) MapR’s M3 distribution of Hadoop

6 Questions

1) why not lucence/solr/elasticSearch/etc

2) Why not store indices directly in HBase?

3) Why use Riak for presentation side?

4) Why did you switch to MapR?

Conclusion:

What I learnt: 1) what most interests me is the NoSQL schema adusting and migration and the index construction. 2) I learnt about the Hbase operation, new tools for data layer access (kundera and Gora)

Still questions: I still don’t quite understand why not store the indices directly in HBase.

Advertisements
 
Leave a comment

Posted by on July 31, 2012 in Cloud Computing

 

Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: