hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@yahoo.com>
Subject Re: Question on cluster capacity planning
Date Thu, 15 Jan 2009 23:01:19 GMT
Hi Michael,

> From: Ryan Rawson
> > Michael Dagaev wrote:
> > 
> >   How did you plan your Hbase cluster capacity ?
> > Currently we run a cluster of 4 large EC2 instances
> I have found with my tests that 3 nodes is wholy
> insufficient.  

I second that, but how much you need is load dependent and
there are no clear formulas that I am aware of for plugging
in your estimated load and getting back a suitable
configuration estimate. Perhaps someday. I think not enough
operational experience is available at this stage. So for
now it's trial and error. I would start with 4 nodes, then
increase as necessary (by 2 or 4 datanode/regionserver
pairs each step) to spread load if you encounter DFS
errors or other errors related to loading. Such errors are
pretty easy to spot: Look for errors regarding blocks not
found, replication failures, heartbeat timeouts, lease
expiration, and such. Generally these have as a root cause
thread starvation from over loading. One telltale sign,
from HBase at least, is messages of the form "We slept 
XXXXXX ms, ten times longer than expected". I give this
advice assuming that you will be running DFS, HBase, and
task trackers (therefore mapreduce mappers and reducers)
concurrently side by side. 

The extra large instance type is required for running HDFS
and HBase daemons side by side. Both are heap intensive
and require about 1G RAM per daemon just to start.

Also, regarding EC2, it probably goes without saying, but
DO NOT use the S3 filesystem to back your HBase tables.
Use local HDFS + HBase on the nodes, and use Hadoop distcp
to back up and import to/from S3 if you need it for

   - Andy


View raw message