hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagane Sundar <jag...@apache.org>
Subject Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
Date Sun, 02 Oct 2011 23:57:34 GMT
Hello Hadoop experts,

I would like to solicit your input in answering this question. Which
proposed distro of Hadoop, 0.20.206 or 0.22, is likely to be the better
platform for hosting HBase?

My requirements are as follows:

1. The Hadoop must support both HBase and MR jobs in the same cluster. At
the very least, MR should be stable and usable for data extraction and
transformation from external sources. Ideally, there should be no limits on
the types of MR jobs that can be run on the HBase cluster. To the best of my
understanding, this implies robust and stable Append and Hflush in HDFS,
correct?

2. I want to scale storage independently from compute. For example, if my
dataset is 1PB, I expect to make a three replica HDFS cluster of ~150
machines with 24TB each. As for MR and HBase compute, I may want to run
anywhere from 50 to 200 machines. Perhaps even scaled on demand, i.e. bring
up more machines into the MR cluster when there is more work to be done, and
bring down some machines when there is less demand. I think that the MR1
Jobtracker can deal with machines coming in and going out well, but I am not
too sure of how HBase works under such dynamic conditions. This example also
indicates the scale that I am most interested in - 1 to 2 PB of data, with a
dynamically varying compute requirement. Will my choice of 0.20.206 or 0.22
affect any of this?

3. Cloud(EC2 or some similar homebrew) friendly: I am talking about hosting
HBase in HDFS on EBS volumes, not HBase on s3 accessed using the s3n
protocol, or HBase on HDFS with blocks stored in S3 and accessed using the
s3 protocol. There are two vectors to this - the storage itself, i.e.
storage performance and efficiency, and the deployment mechanism - whirr or
Ambari or pre-built AMIs with scripts cobbled together. Which release is
likely to have out-of-the-box support for HBase on HDFS in EBS volumes, and
for whirr/Ambari/AMIs?

4. Support for data efficiency improvements such as Erasure Coding -
https://issues.apache.org/**jira/browse/HDFS-503<https://issues.apache.org/jira/browse/HDFS-503>.
Keeping 3 replicas of big data feels like an expensive proposition. Will
0.20.206 or 0.22 include the above patch as part of the base distro, or at
least as an easy to add binary module of some kind?

5. Compatibility with future versions of Hadoop: If I make the (tenuous)
argument that data locality does not matter much, that I have  4Gbps from
each node, that I have 40 Gbps up from each rack, can I separate the storage
from the compute? What I mean is this: I may want to upgrade HDFS less
frequently than MR or HBase. So, is there a snowball's chance in hell of
running HDFS 0.20.206 or 0.22 against MR 0.23 and HBase-whatever-comes-next-
**year?

Thanks in advance, and cheers to a vibrant healthy Hadoop community,
Jagane

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message