hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Milind.Bhandar...@emc.com>
Subject Re: Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
Date Wed, 05 Oct 2011 22:55:03 GMT

I think you have forgotten one major deciding factor:

Which version is *your* vendor committed to support ?

If you are at the same place where you were the last time we met, you have
no other choice but to go with 0.20.206. It's in the contract ! :-)

- Milind

Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)

On 10/2/11 4:57 PM, "Jagane Sundar" <jagane@apache.org> wrote:

>Hello Hadoop experts,
>I would like to solicit your input in answering this question. Which
>proposed distro of Hadoop, 0.20.206 or 0.22, is likely to be the better
>platform for hosting HBase?
>My requirements are as follows:
>1. The Hadoop must support both HBase and MR jobs in the same cluster. At
>the very least, MR should be stable and usable for data extraction and
>transformation from external sources. Ideally, there should be no limits
>the types of MR jobs that can be run on the HBase cluster. To the best of
>understanding, this implies robust and stable Append and Hflush in HDFS,
>2. I want to scale storage independently from compute. For example, if my
>dataset is 1PB, I expect to make a three replica HDFS cluster of ~150
>machines with 24TB each. As for MR and HBase compute, I may want to run
>anywhere from 50 to 200 machines. Perhaps even scaled on demand, i.e.
>up more machines into the MR cluster when there is more work to be done,
>bring down some machines when there is less demand. I think that the MR1
>Jobtracker can deal with machines coming in and going out well, but I am
>too sure of how HBase works under such dynamic conditions. This example
>indicates the scale that I am most interested in - 1 to 2 PB of data,
>with a
>dynamically varying compute requirement. Will my choice of 0.20.206 or
>affect any of this?
>3. Cloud(EC2 or some similar homebrew) friendly: I am talking about
>HBase in HDFS on EBS volumes, not HBase on s3 accessed using the s3n
>protocol, or HBase on HDFS with blocks stored in S3 and accessed using the
>s3 protocol. There are two vectors to this - the storage itself, i.e.
>storage performance and efficiency, and the deployment mechanism - whirr
>Ambari or pre-built AMIs with scripts cobbled together. Which release is
>likely to have out-of-the-box support for HBase on HDFS in EBS volumes,
>for whirr/Ambari/AMIs?
>4. Support for data efficiency improvements such as Erasure Coding -
>Keeping 3 replicas of big data feels like an expensive proposition. Will
>0.20.206 or 0.22 include the above patch as part of the base distro, or at
>least as an easy to add binary module of some kind?
>5. Compatibility with future versions of Hadoop: If I make the (tenuous)
>argument that data locality does not matter much, that I have  4Gbps from
>each node, that I have 40 Gbps up from each rack, can I separate the
>from the compute? What I mean is this: I may want to upgrade HDFS less
>frequently than MR or HBase. So, is there a snowball's chance in hell of
>running HDFS 0.20.206 or 0.22 against MR 0.23 and
>Thanks in advance, and cheers to a vibrant healthy Hadoop community,

View raw message