hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean <seanatpur...@hotmail.com>
Subject Several questions about running HBase on EC2
Date Wed, 21 Apr 2010 05:58:41 GMT

Hi folks,
I am thinking of building a testing environment for a HBase cluster on EC2, and I plan to
build such an environment for the following reasons:
1) To have a reference throughput/read_latency number for different size of HBase cluster.2)
To test various schema design and its performance implication to scan and M/R operation.
-- After having result from 1 and 2, we can decide how to build actual physical cluster. The
reason that we don't want to build physical cluster at the first place is because I understand
that building a 4 nodes cluster does not make too much sense for real load test (we do have
a rough estimation of how big our data size will be).-- At the same time, I hope I can have
got enough high-availability solution during our experimenting on 1 and 2. 
Having said my motivation of this experiment, I'd like ask several questions:
a) After reading http://aws.amazon.com/ec2/instance-types/, I believe I should select "Standard
Instances: Extra Large Instance" as my instance. Though it seems that I should pick "High-Memory
Instances" family because we are talking about memory hungry application here, "High-Memory
Instances" probably does not fit my testing environment -- the disk space does not look like
a good number. Note: after the testing at this environment, I will need to use the benchmark
number as a reference to build my actual cluster.

b) I understand Cloudera provides an AMI, but can I build my own? If I can choose to do so,
can someone give me a pointer? I have successfully built an HBase server on a 4 machine cluster,
how much further effort (please give me an estimate if you would) need I put to achieve this
c) Here is my testing environment:   -- I build an HBase cluster for serving   -- then I build
several clients for issuing work-load opsHow can I get to learn the high-availability lessons
around this (I know most of the high-level ideas, but all subtle issues come from implementation
details as we all know, especially for a distributed system)

Thanks for any suggestion!

The New Busy is not the old busy. Search, chat and e-mail from your inbox.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message