hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind Bhandarkar <mbhandar...@linkedin.com>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Thu, 12 May 2011 16:45:41 GMT
HCK and written specifications are not mutually exclusive. However, given
the evolving nature of Hadoop APIs, functional tests need to evolve as
well, and having them tied to a "current stable" version is easier to do
than it is to tie the written specifications.

- milind

-- 
Milind Bhandarkar
mbhandarkar@linkedin.com
+1-650-776-3167






On 5/11/11 7:26 PM, "M. C. Srivas" <mcsrivas@gmail.com> wrote:

>While the HCK is a great idea to check quickly if an implementation is
>"compliant",  we still need a written specification to define what is
>meant
>by compliance, something akin to a set of RFC's, or a set of docs like the
> IEEE POSIX specifications.
>
>For example, the POSIX.1c pthreads API has a written document that
>specifies
>all the function calls, input params, return values, and error codes. It
>clearly indicates what any POSIX-complaint threads package needs to
>support,
>and what are vendor-specific non-portable extensions that one can use at
>one's own risk.
>
>Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
>specification is extracted only by looking at the code, or (where the code
>is non-trivial) by writing really bizarre test programs to examine corner
>cases. Further, the interaction between a mix of the old and new APIs is
>not
>specified anywhere. Such specifications are vitally important when
>implementing libraries like Cascading, Mahout, etc. For example, an
>application might open a file using the new API, and pass that stream
>into a
>library that manipulates the stream using some of the old API ... what is
>then the expectation of the state of the stream when the library call
>returns?
>
>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail
>such
>things down. There's similar good effort in the Map/Reduce and  Avro
>spaces,
>but it seems to have stalled somewhat. We should continue it.
>
>Doing such specs would be a great service to the community and the users
>of
>Hadoop. It provides them
>   (a) clear-cut docs on how to use the Hadoop APIs
>   (b) wider choice of Hadoop implementations by freeing them from vendor
>lock-in.
>
>Once we have such specification, the HCK becomes meaningful (since the HCK
>itself will be buggy initially).
>
>
>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar
><mbhandarkar@linkedin.com
>> wrote:
>
>> I think it's time to separate out functional tests as a "Hadoop
>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL
>> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite."
>>
>> - milind
>> --
>> Milind Bhandarkar
>> mbhandarkar@linkedin.com
>>
>>
>>
>>
>>
>>
>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com> wrote:
>>
>> >This is a really interesting topic!  I completely agree that we need to
>> >get ahead of this.
>> >
>> >I would be really interested in learning of any experience other apache
>> >projects, such as apache or tomcat have with these issues.
>> >
>> >---
>> >E14 - typing on glass
>> >
>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org>
>>wrote:
>> >
>> >>
>> >> Back in Jan 2011, I started a discussion about how to define Apache
>> >> Hadoop Compatibility:
>> >>
>> >>
>> 
>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D
>> >>46B6AD.2020802@apache.org%3E
>> >>
>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
>> >>
>> >>
>> 
>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_
>>>>1
>> .
>> >>pdf
>> >>
>> >> It claims that their implementations are 100% compatible, even though
>> >> the Enterprise edition uses a C filesystem. It also claims that both
>> >> their software releases contain "Certified Stacks", without defining
>> >> what Certified means, or who does the certification -only that it is
>>an
>> >> improvement.
>> >>
>> >>
>> >> I think we should revisit this issue before people with their own
>> >> agendas define what compatibility with Apache Hadoop is for us
>> >>
>> >>
>> >> Licensing
>> >> -Use of the Hadoop codebase must follow the Apache License
>> >> http://www.apache.org/licenses/LICENSE-2.0
>> >> -plug in components that are dynamically linked to (Filesystems and
>> >> schedulers) don't appear to be derivative works on my reading of
>>this,
>> >>
>> >> Naming
>> >>  -this is something for branding@apache, they will have their
>>opinions.
>> >> The key one is that the name "Apache Hadoop" must get used, and it's
>> >> important to make clear it is a derivative work.
>> >>  -I don't think you can claim to have a Distribution/Fork/Version of
>> >> Apache Hadoop if you swap out big chunks of it for alternate
>> >> filesystems, MR engines, etc. Some description of this is needed
>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem
>>XYZ"
>> >>
>> >> Compatibility
>> >>  -the definition of the Hadoop interfaces and classes is the Apache
>> >> Source tree,
>> >>  -the definition of semantics of the Hadoop interfaces and classes is
>> >> the Apache Source tree, including the test classes.
>> >>  -the verification that the actual semantics of an Apache Hadoop
>> >> release is compatible with the expected semantics is that current and
>> >> future tests pass
>> >>  -bug reports can highlight incompatibility with expectations of
>> >> community users, and once incorporated into tests form part of the
>> >> compatibility testing
>> >>  -vendors can claim and even certify their derivative works as
>> >> compatible with other versions of their derivative works, but cannot
>> >> claim compatibility with Apache Hadoop unless their code passes the
>> >> tests and is consistent with the bug reports marked as ("by design").
>> >> Perhaps we should have tests that verify each of these "by design"
>> >> bugreps to make them more formal.
>> >>
>> >> Certification
>> >>  -I have no idea what this means in EMC's case, they just say
>> >>"Certified"
>> >>  -As we don't do any certification ourselves, it would seem
>>impossible
>> >> for us to certify that any derivative work is compatible.
>> >>  -It may be best to state that nobody can certify their derivative as
>> >> "compatible with Apache Hadoop" unless it passes all current test
>>suites
>> >>  -And require that anyone who declares compatibility define what they
>> >> mean by this
>> >>
>> >> This is a good argument for getting more functional tests out there
>> >> -whoever has more functional tests needs to get them into a test
>>module
>> >> that can be used to test real deployments.
>> >>
>>
>>


Mime
View raw message