hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Cluster Size/Node Density
Date Fri, 17 Dec 2010 19:46:46 GMT
Hi Wayne,

This question has such a large scope but is applicable to such a tiny
subset of workloads (eg yours) that fielding all the questions in
details would probably end up just wasting everyone's cycles. So first
I'd like to clear up some confusion.

> We would like some help with cluster sizing estimates. We have 15TB of
> currently relational data we want to store in hbase.

There's the 3x replication factor, but also you have to account that
each value is stored with it's row key, family name, qualifier and
timestamp. That could be a lot more data to store, but at the same
time you can use LZO compression to bring that down ~4x.

> How many nodes, regions, etc. are we going to need?

You don't really have the control over regions, they are created for
you as your data grows.

> What will our read latency be for 30 vs. 100? Sure we can pack 20 nodes with 3TB
> of data each but will it take 1+s for every get?

I'm not sure what kind of back-of-the-envelope calculations took you
to 1sec, but latency will be strictly determined by concurrency and
actual machine load. Even if you were able to pack 20TB in one onde
but using a tiny portion of it, you would still get sub 100ms
latencies. Or if you have only 10GB on that node but it's getting
hammered by 10000 clients, then you should expect much higher
latencies.

> Will compaction run for 3 days?

Which compactions? Major ones? If you don't insert new data in a
region, it won't be major compacted. Also if you have that much data,
I would set the time between major compactions to be bigger than 1
day. Heck, since you are doing time series, this means you'll never
delete anything right? So you might as well disable them.

And now for the meaty part...

The size of your dataset is only one part of the equation, the other
being traffic you would be pushing to the cluster which I think wasn't
covered at all in your email. Like I said previously, you can pack a
lot of data in a single node and can retrieve it really fast as long
as concurrency is low. Another thing is how random your reading
pattern is... can you even leverage the block cache at all? If yes,
then you can accept more concurrency, if not then hitting HDFS is a
lot slower (and it's still not very good at handling many clients).

Unfortunately, even if you gave us exactly how many QPS you want to do
per second, we'd have a hard time recommending any number of nodes.

What I would recommend then is to benchmark it. Try to grab 5-6
machines, load a subset of the data, generate traffic, see how it
behaves.

Hope that helps,

J-D

On Fri, Dec 17, 2010 at 9:09 AM, Wayne <wav100@gmail.com> wrote:
> We would like some help with cluster sizing estimates. We have 15TB of
> currently relational data we want to store in hbase. Once that is replicated
> to a factor of 3 and stored with secondary indexes etc. we assume will have
> 50TB+ of data. The data is basically data warehouse style time series data
> where much of it is cold, however want good read latency to get access to
> all of it. Not memory based latency but < 25ms latency for a small chunks of
> data.
>
> How many nodes, regions, etc. are we going to need? Assuming a typical 6
> disk, 24GB ram, 16 core data node, how many of these do we need to
> sufficiently manage this volume of data? Obviously there are a million "it
> depends", but the bigger drivers are how much data can a node handle? How
> long will compaction take? How many regions can a node handle and how big
> can those regions get? Can we really have 1.5TB of data on a single node in
> 6,000 regions? What are the true drivers between more nodes vs. bigger
> nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes? What
> will our read latency be for 30 vs. 100? Sure we can pack 20 nodes with 3TB
> of data each but will it take 1+s for every get? Will compaction run for 3
> days? How much data is really "too much" on an hbase data node?
>
> Any help or advice would be greatly appreciated.
>
> Thanks
>
> Wayne
>

Mime
View raw message