hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: Cluster Size/Node Density
Date Fri, 17 Dec 2010 20:29:24 GMT
Sorry, I am sure my questions were far too broad to answer.

Let me *try* to ask more specific questions. Assuming all data requests are
cold (random reading pattern) and everything comes from the disks (no block
cache), what level of concurrency can HDFS handle? Almost all of the load is
controlled data processing, but we have to do a lot of work at night during
the batch window so something in the 15-20,000 QPS range would meet current
worse case requirements. How many nodes would be required to effectively
return data against a 50TB aggregate data store? Disk I/O assumedly starts
to break down at a certain point in terms of concurrent readers/node/disk.
We have in our control how many total concurrent readers there are, so if we
can get 10ms response time with 100 readers that might be better than 100ms
responses from 1000 concurrent readers.

Thanks.


On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Hi Wayne,
>
> This question has such a large scope but is applicable to such a tiny
> subset of workloads (eg yours) that fielding all the questions in
> details would probably end up just wasting everyone's cycles. So first
> I'd like to clear up some confusion.
>
> > We would like some help with cluster sizing estimates. We have 15TB of
> > currently relational data we want to store in hbase.
>
> There's the 3x replication factor, but also you have to account that
> each value is stored with it's row key, family name, qualifier and
> timestamp. That could be a lot more data to store, but at the same
> time you can use LZO compression to bring that down ~4x.
>
> > How many nodes, regions, etc. are we going to need?
>
> You don't really have the control over regions, they are created for
> you as your data grows.
>
> > What will our read latency be for 30 vs. 100? Sure we can pack 20 nodes
> with 3TB
> > of data each but will it take 1+s for every get?
>
> I'm not sure what kind of back-of-the-envelope calculations took you
> to 1sec, but latency will be strictly determined by concurrency and
> actual machine load. Even if you were able to pack 20TB in one onde
> but using a tiny portion of it, you would still get sub 100ms
> latencies. Or if you have only 10GB on that node but it's getting
> hammered by 10000 clients, then you should expect much higher
> latencies.
>
> > Will compaction run for 3 days?
>
> Which compactions? Major ones? If you don't insert new data in a
> region, it won't be major compacted. Also if you have that much data,
> I would set the time between major compactions to be bigger than 1
> day. Heck, since you are doing time series, this means you'll never
> delete anything right? So you might as well disable them.
>
> And now for the meaty part...
>
> The size of your dataset is only one part of the equation, the other
> being traffic you would be pushing to the cluster which I think wasn't
> covered at all in your email. Like I said previously, you can pack a
> lot of data in a single node and can retrieve it really fast as long
> as concurrency is low. Another thing is how random your reading
> pattern is... can you even leverage the block cache at all? If yes,
> then you can accept more concurrency, if not then hitting HDFS is a
> lot slower (and it's still not very good at handling many clients).
>
> Unfortunately, even if you gave us exactly how many QPS you want to do
> per second, we'd have a hard time recommending any number of nodes.
>
> What I would recommend then is to benchmark it. Try to grab 5-6
> machines, load a subset of the data, generate traffic, see how it
> behaves.
>
> Hope that helps,
>
> J-D
>
> On Fri, Dec 17, 2010 at 9:09 AM, Wayne <wav100@gmail.com> wrote:
> > We would like some help with cluster sizing estimates. We have 15TB of
> > currently relational data we want to store in hbase. Once that is
> replicated
> > to a factor of 3 and stored with secondary indexes etc. we assume will
> have
> > 50TB+ of data. The data is basically data warehouse style time series
> data
> > where much of it is cold, however want good read latency to get access to
> > all of it. Not memory based latency but < 25ms latency for a small chunks
> of
> > data.
> >
> > How many nodes, regions, etc. are we going to need? Assuming a typical 6
> > disk, 24GB ram, 16 core data node, how many of these do we need to
> > sufficiently manage this volume of data? Obviously there are a million
> "it
> > depends", but the bigger drivers are how much data can a node handle? How
> > long will compaction take? How many regions can a node handle and how big
> > can those regions get? Can we really have 1.5TB of data on a single node
> in
> > 6,000 regions? What are the true drivers between more nodes vs. bigger
> > nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes? What
> > will our read latency be for 30 vs. 100? Sure we can pack 20 nodes with
> 3TB
> > of data each but will it take 1+s for every get? Will compaction run for
> 3
> > days? How much data is really "too much" on an hbase data node?
> >
> > Any help or advice would be greatly appreciated.
> >
> > Thanks
> >
> > Wayne
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message