Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of mcsrivas@gmail.com designates
 209.85.210.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=QnWQKI0WkabpySHf0Q6IzDBAkYkZTvSuhUgO5r42G8Ep5iP1y1B3OyZ6AaCv7jE6Zt
         xcVtCcyf1m/ioiA0wqB9HuiRFR0CQMJx236RWg3tzWGWJb4aORWlfSbd+rJV4HZT3EFR
         oy+An8LJQD/PsKxTDpPprkkElQozBDVnYyPaE=
MIME-Version: 1.0
In-Reply-To: <C9F04FDB.8C82%mbhandarkar@linkedin.com>
References: <FDD1BE64-981D-4FA1-87BC-0FD4A880AB7A@yahoo-inc.com>
	<C9F04FDB.8C82%mbhandarkar@linkedin.com>
Date: Wed, 11 May 2011 19:26:03 -0700
Message-ID: <BANLkTimnQxosSzWVyE_sZMn-eJv_pT0jvA@mail.gmail.com>
Subject: Re: Defining Hadoop Compatibility -revisiting-
From: "M. C. Srivas" <mcsrivas@gmail.com>
To: general@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec53af1682566c504a30ae993

--bcaec53af1682566c504a30ae993
Content-Type: text/plain; charset=ISO-8859-1

While the HCK is a great idea to check quickly if an implementation is
"compliant",  we still need a written specification to define what is meant
by compliance, something akin to a set of RFC's, or a set of docs like the
 IEEE POSIX specifications.

For example, the POSIX.1c pthreads API has a written document that specifies
all the function calls, input params, return values, and error codes. It
clearly indicates what any POSIX-complaint threads package needs to support,
and what are vendor-specific non-portable extensions that one can use at
one's own risk.

Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
specification is extracted only by looking at the code, or (where the code
is non-trivial) by writing really bizarre test programs to examine corner
cases. Further, the interaction between a mix of the old and new APIs is not
specified anywhere. Such specifications are vitally important when
implementing libraries like Cascading, Mahout, etc. For example, an
application might open a file using the new API, and pass that stream into a
library that manipulates the stream using some of the old API ... what is
then the expectation of the state of the stream when the library call
returns?

Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such
things down. There's similar good effort in the Map/Reduce and  Avro spaces,
but it seems to have stalled somewhat. We should continue it.

Doing such specs would be a great service to the community and the users of
Hadoop. It provides them
   (a) clear-cut docs on how to use the Hadoop APIs
   (b) wider choice of Hadoop implementations by freeing them from vendor
lock-in.

Once we have such specification, the HCK becomes meaningful (since the HCK
itself will be buggy initially).


On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar <mbhandarkar@linkedin.com
> wrote:

> I think it's time to separate out functional tests as a "Hadoop
> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL
> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite."
>
> - milind
> --
> Milind Bhandarkar
> mbhandarkar@linkedin.com
>
>
>
>
>
>
> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com> wrote:
>
> >This is a really interesting topic!  I completely agree that we need to
> >get ahead of this.
> >
> >I would be really interested in learning of any experience other apache
> >projects, such as apache or tomcat have with these issues.
> >
> >---
> >E14 - typing on glass
> >
> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org> wrote:
> >
> >>
> >> Back in Jan 2011, I started a discussion about how to define Apache
> >> Hadoop Compatibility:
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D
> >>46B6AD.2020802@apache.org%3E
> >>
> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
> >>
> >>
> >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1
> .
> >>pdf
> >>
> >> It claims that their implementations are 100% compatible, even though
> >> the Enterprise edition uses a C filesystem. It also claims that both
> >> their software releases contain "Certified Stacks", without defining
> >> what Certified means, or who does the certification -only that it is an
> >> improvement.
> >>
> >>
> >> I think we should revisit this issue before people with their own
> >> agendas define what compatibility with Apache Hadoop is for us
> >>
> >>
> >> Licensing
> >> -Use of the Hadoop codebase must follow the Apache License
> >> http://www.apache.org/licenses/LICENSE-2.0
> >> -plug in components that are dynamically linked to (Filesystems and
> >> schedulers) don't appear to be derivative works on my reading of this,
> >>
> >> Naming
> >>  -this is something for branding@apache, they will have their opinions.
> >> The key one is that the name "Apache Hadoop" must get used, and it's
> >> important to make clear it is a derivative work.
> >>  -I don't think you can claim to have a Distribution/Fork/Version of
> >> Apache Hadoop if you swap out big chunks of it for alternate
> >> filesystems, MR engines, etc. Some description of this is needed
> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ"
> >>
> >> Compatibility
> >>  -the definition of the Hadoop interfaces and classes is the Apache
> >> Source tree,
> >>  -the definition of semantics of the Hadoop interfaces and classes is
> >> the Apache Source tree, including the test classes.
> >>  -the verification that the actual semantics of an Apache Hadoop
> >> release is compatible with the expected semantics is that current and
> >> future tests pass
> >>  -bug reports can highlight incompatibility with expectations of
> >> community users, and once incorporated into tests form part of the
> >> compatibility testing
> >>  -vendors can claim and even certify their derivative works as
> >> compatible with other versions of their derivative works, but cannot
> >> claim compatibility with Apache Hadoop unless their code passes the
> >> tests and is consistent with the bug reports marked as ("by design").
> >> Perhaps we should have tests that verify each of these "by design"
> >> bugreps to make them more formal.
> >>
> >> Certification
> >>  -I have no idea what this means in EMC's case, they just say
> >>"Certified"
> >>  -As we don't do any certification ourselves, it would seem impossible
> >> for us to certify that any derivative work is compatible.
> >>  -It may be best to state that nobody can certify their derivative as
> >> "compatible with Apache Hadoop" unless it passes all current test suites
> >>  -And require that anyone who declares compatibility define what they
> >> mean by this
> >>
> >> This is a good argument for getting more functional tests out there
> >> -whoever has more functional tests needs to get them into a test module
> >> that can be used to test real deployments.
> >>
>
>

--bcaec53af1682566c504a30ae993--