hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: how to store 100billion short text messages with hbase
Date Thu, 06 Dec 2012 03:00:42 GMT

The best way to think about how to structure your data in HBase is to ask the question: "How
will I access it?". Perhaps you could reply with the sorts of queries you expect to be able
to do over this data? For example, retrieve any single conversation between two people in
< 10 ms; or show all conversations that happened in a single hour, regardless of participants.
HBase only gives you fast GET/SCAN access along a single "primary" key (the row key) so you
must choose it carefully, or else duplicate & denormalize your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B messages x 1K bytes
per message on average comes out to 100TB. That, plus 3x replication in HDFS, means you need
roughly 300TB of space. If you have 13 nodes (taking out 2 for redundant master services)
that's a requirement for about 23T of space per server. That's a lot, even these days. Did
I get all that math right?

On your question about multiple tables: a table in HBase is only a namespace for rowkeys,
and a container for a set of regions. If it's a homogenous data set, there's no advantage
to breaking the table into multiple tables; that's what regions within the table are for.


ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

> Hi
> 	I try to use hbase to store 100billion short texts messages, each
> message has less than 1000 character and some other items, that is, each
> messages has less than 10 items,
> 	The whole data is a stream for about one year, and I want to create
> multi tables to store these data, I have two ideas, the one is to store the
> data in one hour in one table, and for one year data, there are 365*24
> tables, the other is to store the date in one day in one table, and for one
> year , there are 365 tables,
> 	And I have about 15 computer nodes to handle these data, and I want
> to know how to deal with these data, the one for 365*24 tables , or the one
> for 365 tables, or other better ideas, 
> 	I am really confused about hbase, it is powerful yet a bit complex
> for me , is it?
> 	Could you give me some advice for hbase data schema and others,
> 	Could you help me,
> Thank you
> ---------------------------------
> Tian Guanhua

View raw message