hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagane Sundar <jag...@apache.org>
Subject Re: Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
Date Wed, 05 Oct 2011 23:20:35 GMT
Hello Milind,

A large part of why I sent this email out was to initiate a discussion of
the priority of specific features in a Hadoop distro.

For example, if we had a distro with support for the following features:

1. Hbase support, i.e. working scale tested Append and Hflush in HDFS
2. Built in support for the cloud. (Whirr is interesting. Ambari more so,
but both fall short.)
3. Assumption that 10GBE is around the corner (really, this time), and hence
storage locality is irrelevant
4. Storage efficiency is important. Alternatives to a 3 replica HDFS, such
as erasure code, should be first class citizens in this distro.
5. H/A for the NN

Such a distro would be an outstanding thing for the Hadoop community. I
think 0.20.20x is the closest to this, but I am not sure.

My hope is that this discussion will get some input from users of Hadoop. I
may be wrong, as this may be the wrong forum for this discussion. (The only
thing I really accomplished was to evoke a hurried and semi-infuriated
Sunday afternoon private email response from some key players in the Hadoop
community).

My ultimate goal is to influence the product managers at Hadoop startups and
established companies to assign high priorities to these items.

In short, I don't own the whip, the buggy, or the horse ... but I am trying
to crack the whip. :-)

Milind - I do look forward to your input as to the importance of these
features, and whether these are feasible in one of the source branches in
the near future.

Cheers,
Jagane

On Wed, Oct 5, 2011 at 3:55 PM, <Milind.Bhandarkar@emc.com> wrote:

> Jagane,
>
> I think you have forgotten one major deciding factor:
>
> Which version is *your* vendor committed to support ?
>
> If you are at the same place where you were the last time we met, you have
> no other choice but to go with 0.20.206. It's in the contract ! :-)
>
> - Milind
>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
>
>
>
> On 10/2/11 4:57 PM, "Jagane Sundar" <jagane@apache.org> wrote:
>
> >Hello Hadoop experts,
> >
> >I would like to solicit your input in answering this question. Which
> >proposed distro of Hadoop, 0.20.206 or 0.22, is likely to be the better
> >platform for hosting HBase?
> >
> >My requirements are as follows:
> >
> >1. The Hadoop must support both HBase and MR jobs in the same cluster. At
> >the very least, MR should be stable and usable for data extraction and
> >transformation from external sources. Ideally, there should be no limits
> >on
> >the types of MR jobs that can be run on the HBase cluster. To the best of
> >my
> >understanding, this implies robust and stable Append and Hflush in HDFS,
> >correct?
> >
> >2. I want to scale storage independently from compute. For example, if my
> >dataset is 1PB, I expect to make a three replica HDFS cluster of ~150
> >machines with 24TB each. As for MR and HBase compute, I may want to run
> >anywhere from 50 to 200 machines. Perhaps even scaled on demand, i.e.
> >bring
> >up more machines into the MR cluster when there is more work to be done,
> >and
> >bring down some machines when there is less demand. I think that the MR1
> >Jobtracker can deal with machines coming in and going out well, but I am
> >not
> >too sure of how HBase works under such dynamic conditions. This example
> >also
> >indicates the scale that I am most interested in - 1 to 2 PB of data,
> >with a
> >dynamically varying compute requirement. Will my choice of 0.20.206 or
> >0.22
> >affect any of this?
> >
> >3. Cloud(EC2 or some similar homebrew) friendly: I am talking about
> >hosting
> >HBase in HDFS on EBS volumes, not HBase on s3 accessed using the s3n
> >protocol, or HBase on HDFS with blocks stored in S3 and accessed using the
> >s3 protocol. There are two vectors to this - the storage itself, i.e.
> >storage performance and efficiency, and the deployment mechanism - whirr
> >or
> >Ambari or pre-built AMIs with scripts cobbled together. Which release is
> >likely to have out-of-the-box support for HBase on HDFS in EBS volumes,
> >and
> >for whirr/Ambari/AMIs?
> >
> >4. Support for data efficiency improvements such as Erasure Coding -
> >https://issues.apache.org/**jira/browse/HDFS-503<
> https://issues.apache.org
> >/jira/browse/HDFS-503>.
> >Keeping 3 replicas of big data feels like an expensive proposition. Will
> >0.20.206 or 0.22 include the above patch as part of the base distro, or at
> >least as an easy to add binary module of some kind?
> >
> >5. Compatibility with future versions of Hadoop: If I make the (tenuous)
> >argument that data locality does not matter much, that I have  4Gbps from
> >each node, that I have 40 Gbps up from each rack, can I separate the
> >storage
> >from the compute? What I mean is this: I may want to upgrade HDFS less
> >frequently than MR or HBase. So, is there a snowball's chance in hell of
> >running HDFS 0.20.206 or 0.22 against MR 0.23 and
> >HBase-whatever-comes-next-
> >**year?
> >
> >Thanks in advance, and cheers to a vibrant healthy Hadoop community,
> >Jagane
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message