hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M. C. Srivas" <mcsri...@gmail.com>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Thu, 12 May 2011 02:26:03 GMT
While the HCK is a great idea to check quickly if an implementation is
"compliant",  we still need a written specification to define what is meant
by compliance, something akin to a set of RFC's, or a set of docs like the
 IEEE POSIX specifications.

For example, the POSIX.1c pthreads API has a written document that specifies
all the function calls, input params, return values, and error codes. It
clearly indicates what any POSIX-complaint threads package needs to support,
and what are vendor-specific non-portable extensions that one can use at
one's own risk.

Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
specification is extracted only by looking at the code, or (where the code
is non-trivial) by writing really bizarre test programs to examine corner
cases. Further, the interaction between a mix of the old and new APIs is not
specified anywhere. Such specifications are vitally important when
implementing libraries like Cascading, Mahout, etc. For example, an
application might open a file using the new API, and pass that stream into a
library that manipulates the stream using some of the old API ... what is
then the expectation of the state of the stream when the library call

Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such
things down. There's similar good effort in the Map/Reduce and  Avro spaces,
but it seems to have stalled somewhat. We should continue it.

Doing such specs would be a great service to the community and the users of
Hadoop. It provides them
   (a) clear-cut docs on how to use the Hadoop APIs
   (b) wider choice of Hadoop implementations by freeing them from vendor

Once we have such specification, the HCK becomes meaningful (since the HCK
itself will be buggy initially).

On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar <mbhandarkar@linkedin.com
> wrote:

> I think it's time to separate out functional tests as a "Hadoop
> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL
> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite."
> - milind
> --
> Milind Bhandarkar
> mbhandarkar@linkedin.com
> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com> wrote:
> >This is a really interesting topic!  I completely agree that we need to
> >get ahead of this.
> >
> >I would be really interested in learning of any experience other apache
> >projects, such as apache or tomcat have with these issues.
> >
> >---
> >E14 - typing on glass
> >
> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org> wrote:
> >
> >>
> >> Back in Jan 2011, I started a discussion about how to define Apache
> >> Hadoop Compatibility:
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D
> >>46B6AD.2020802@apache.org%3E
> >>
> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
> >>
> >>
> >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1
> .
> >>pdf
> >>
> >> It claims that their implementations are 100% compatible, even though
> >> the Enterprise edition uses a C filesystem. It also claims that both
> >> their software releases contain "Certified Stacks", without defining
> >> what Certified means, or who does the certification -only that it is an
> >> improvement.
> >>
> >>
> >> I think we should revisit this issue before people with their own
> >> agendas define what compatibility with Apache Hadoop is for us
> >>
> >>
> >> Licensing
> >> -Use of the Hadoop codebase must follow the Apache License
> >> http://www.apache.org/licenses/LICENSE-2.0
> >> -plug in components that are dynamically linked to (Filesystems and
> >> schedulers) don't appear to be derivative works on my reading of this,
> >>
> >> Naming
> >>  -this is something for branding@apache, they will have their opinions.
> >> The key one is that the name "Apache Hadoop" must get used, and it's
> >> important to make clear it is a derivative work.
> >>  -I don't think you can claim to have a Distribution/Fork/Version of
> >> Apache Hadoop if you swap out big chunks of it for alternate
> >> filesystems, MR engines, etc. Some description of this is needed
> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ"
> >>
> >> Compatibility
> >>  -the definition of the Hadoop interfaces and classes is the Apache
> >> Source tree,
> >>  -the definition of semantics of the Hadoop interfaces and classes is
> >> the Apache Source tree, including the test classes.
> >>  -the verification that the actual semantics of an Apache Hadoop
> >> release is compatible with the expected semantics is that current and
> >> future tests pass
> >>  -bug reports can highlight incompatibility with expectations of
> >> community users, and once incorporated into tests form part of the
> >> compatibility testing
> >>  -vendors can claim and even certify their derivative works as
> >> compatible with other versions of their derivative works, but cannot
> >> claim compatibility with Apache Hadoop unless their code passes the
> >> tests and is consistent with the bug reports marked as ("by design").
> >> Perhaps we should have tests that verify each of these "by design"
> >> bugreps to make them more formal.
> >>
> >> Certification
> >>  -I have no idea what this means in EMC's case, they just say
> >>"Certified"
> >>  -As we don't do any certification ourselves, it would seem impossible
> >> for us to certify that any derivative work is compatible.
> >>  -It may be best to state that nobody can certify their derivative as
> >> "compatible with Apache Hadoop" unless it passes all current test suites
> >>  -And require that anyone who declares compatibility define what they
> >> mean by this
> >>
> >> This is a good argument for getting more functional tests out there
> >> -whoever has more functional tests needs to get them into a test module
> >> that can be used to test real deployments.
> >>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message