hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: [VOTE] Shall we adopt the "Defining Hadoop" page
Date Wed, 22 Jun 2011 15:41:33 GMT
I agree with this.

We need to find a middle ground that achieves three aims:

1) Makes it clear that an ASF release of Hadoop is THE APACHE HADOOP.  Jeff's manpower argument
actually reinforces this.  We need a very testable definition of what is an Apache Hadoop
Release or enforcement will be impossible because each test of the policy might require a
visit to the supreme court.  It's MD5 matches the MD5 of an apache release is a clear definition.

2) We need a proposal for derived products that vendors feel are branding friendly.  These
should be clear enough that users understand the difference between a product that packages
Apache Hadoop (MD5 test), one that is completely open source under the Apache license (easy
to test) and one that simply uses some subset of the code under a more restrictive license
or closed source.

3) Compatibility: I think it would be great to harness all this energy around compatibility
to start a compatibility suite inside the Apache Hadoop project.  Then we could define compatible
with Apache Hadoop in a clear way controlled by the Apache Hadoop PMC.  With luck vendors
on both sides of the debate will be incentivized to contribute to the project this way.  Such
a suite would also prove useful to the developers of Apache Hadoop.


On Jun 20, 2011, at 10:09 AM, Ted Dunning wrote:

> Great summary Andrew.
> I would add one more precipitating factor here.  That is the arrival of a
> number of products which are very close to the Apache version of Hadoop but
> for which there is no good and widely accepted terminology that gives proper
> credit to their lineage while making clear the distinction from bit-for-bit
> copies of official Apache releases.
> Some products are analogous to hive, pig or hbase in that they are
> independent systems that run ON hadoop (or close equivalents).  These have
> no terminology problem because these products aren't hadoop, but rather use
> hadoop.
> Other products contain Hadoop internally as a critical component but do not
> necessarily expose Hadoop capabilities to the end user (I can't name these
> products, but they exist).  These products have little nomenclatural
> difficulty because the powerd-by-Hadoop description fits very well.
> The products with the terminology problem are the ones that are add either
> curation and packaging (Cloudera) or substantial additional performance
> enhancing components (MapR).  These products are upwardly compatible with
> Apache Hadoop in that programs that run on Hadoop will very probably run on
> these Hadoop-like systems.  The problem is that there is no good term for
> these products.  They may even contain components that are bit-for-bit
> identical to the same components for Apache releases.  It is fair to say
> that these are not Apache released software, but it is also fair to say that
> there ought to be a better name for the class of these products.
> On Mon, Jun 20, 2011 at 4:39 PM, Andrew Purtell <apurtell@apache.org> wrote:
>> Hadoop I think needs to be more careful. What triggered this discussion is
>> the arrival of new players releasing products they call Hadoop but
>> containing severe changes the community, by way of the ASF umbrella we all
>> work under, had nothing to do with designing or developing. And some of
>> these are being open sourced as a Hadoop. There is no Linus here. Which of
>> these is _the_ Hadoop? As a would-be contributor, which should I select?

View raw message