hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject Re: Cassandra vs HBase
Date Thu, 03 Sep 2009 18:04:20 GMT
There are lots of interesting ways you can design your tables and keys to
avoid the single-regionserver hotspotting.

I did an experimental design a long time ago that pre-pended a random
value to every row key, where the value was modulo'd by the number of
regionservers or between 1/2 and 2 * #ofRS, so for a given stamp there
would be that many potential regions it could go into.  This design
doesn't make time-range MR jobs very efficient though because a single
range is spread out across the entire table... But I'm not sure you can
avoid that if you want good distribution, those two requirements are at
odds.

You say 2TB of data a day on 10-20 nodes?  What kinds of nodes are you
expecting to use?

In a month, that's 60TB of data, so 3-6TB per node?  And that's
pre-replication, so you're talking 9-18TB per node?  And you want full
random access to that, while running batch MR jobs, while continuously
importing more?  Seems that's a tall order.  You'd be adding >1000 regions
a day... and on 10-20 nodes?

Do you really need full random access to the entire raw dataset?  Could
you load into HDFS, run batch jobs against HDFS, but also have some jobs
that take HDFS data, run some aggregations/filters/etc, and then put
_that_ data into HBase?

You also say you're going to delete data.  What's the time window you want
to keep?

HBase is capable of handling lots of stuff but you seem to want to process
very large datasets (and the trifecta: heavy writes, batch/scanning reads,
and random reads) on a very small cluster.  10 nodes is really
bare-minimum for any production system serving any reasonably sized
dataset (>1TB), unless the individual nodes are very powerful.

JG

On Thu, September 3, 2009 12:15 am, stack wrote:
> On Wed, Sep 2, 2009 at 11:37 PM, Schubert Zhang <zsongbo@gmail.com>
> wrote:
>
>
>>
>>> Do you need to keep it all?  Does some data expire (or can it be
>>> moved offline)?
>>>
>>>
>> Yes, we need remove old data which expire.
>>
>>
>>
> When does data expire?  Or, how many Billions of rows should your cluster
> of 10-20 nodes carry at a time?
>
>
>
> The data will arrive with a minutes delay.
>
>
> Usually, we need to write/ingest tens of thousands of new rows. Many rows
>
>> with the same timestamp.
>>
>>
> Will the many rows of same timestamp all go into the one timestamp row or
>  will the key have a further qualifier such as event type to distingush
> amongst the updates that arrive at the same timestamp?
>
> What do you see the as approximate write rate and what do you think its
> spread across timestamps will be?  E.g. 10000 updates a second and all of
> the updates fit within a ten second window?
>
> Sorry for all the questions.
>
>
> St.Ack
>
>


Mime
View raw message