hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagane Sundar <jag...@apache.org>
Subject Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
Date Sun, 02 Oct 2011 23:15:49 GMT
Hello Hadoop experts,

I would like to solicit your input in answering this question. Which 
proposed distro of Hadoop, 0.20.206 or 0.22, is likely to be the better 
platform for hosting HBase?

My requirements are as follows:

1. The Hadoop must support both HBase and MR jobs in the same cluster. 
At the very least, MR should be stable and usable for data extraction 
and transformation from external sources. Ideally, there should be no 
limits on the types of MR jobs that can be run on the HBase cluster. To 
the best of my understanding, this implies robust and stable Append and 
Hflush in HDFS, correct?

2. I want to scale storage independently from compute. For example, if 
my dataset is 1PB, I expect to make a three replica HDFS cluster of ~150 
machines with 24TB each. As for MR and HBase compute, I may want to run 
anywhere from 50 to 200 machines. Perhaps even scaled on demand, i.e. 
bring up more machines into the MR cluster when there is more work to be 
done, and bring down some machines when there is less demand. I think 
that the MR1 Jobtracker can deal with machines coming in and going out 
well, but I am not too sure of how HBase works under such dynamic 
conditions. This example also indicates the scale that I am most 
interested in - 1 to 2 PB of data, with a dynamically varying compute 
requirement. Will my choice of 0.20.206 or 0.22 affect any of this?

3. Cloud(EC2 or some similar homebrew) friendly: I am talking about 
hosting HBase in HDFS on EBS volumes, not HBase on s3 accessed using the 
s3n protocol, or HBase on HDFS with blocks stored in S3 and accessed 
using the s3 protocol. There are two vectors to this - the storage 
itself, i.e. storage performance and efficiency, and the deployment 
mechanism - whirr or Ambari or pre-built AMIs with scripts cobbled 
together. Which release is likely to have out-of-the-box support for 
HBase on HDFS in EBS volumes, and for whirr/Ambari/AMIs?

4. Support for data efficiency improvements such as Erasure Coding 

Keeping 3 replicas of big data feels like an expensive proposition. Will 
0.20.206 or 0.22 include the above patch as part of the base distro, or 
at least as an easy to add binary module of some kind?

5. Compatibility with future versions of Hadoop: If I make the (tenuous) 
argument that data locality does not matter much, that I have  4Gbps 
from each node, that I have 40 Gbps up from each rack, can I separate 
the storage from the compute? What I mean is this: I may want to upgrade 
HDFS less frequently than MR or HBase. So, is there a snowball's chance 
in hell of running HDFS 0.20.206 or 0.22 against MR 0.23 and 

Thanks in advance, and cheers to a vibrant healthy Hadoop community,

View raw message