hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@apache.org>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Thu, 12 May 2011 22:30:00 GMT
On Thu, May 12, 2011 at 09:45, Milind Bhandarkar
<mbhandarkar@linkedin.com> wrote:
> HCK and written specifications are not mutually exclusive. However, given
> the evolving nature of Hadoop APIs, functional tests need to evolve as

I would actually expand it to 'functional and system tests' because
latter are capable of validating inter-component iterations not
coverable by functional tests.

Cos

> well, and having them tied to a "current stable" version is easier to do
> than it is to tie the written specifications.
>
> - milind
>
> --
> Milind Bhandarkar
> mbhandarkar@linkedin.com
> +1-650-776-3167
>
>
>
>
>
>
> On 5/11/11 7:26 PM, "M. C. Srivas" <mcsrivas@gmail.com> wrote:
>
>>While the HCK is a great idea to check quickly if an implementation is
>>"compliant",  we still need a written specification to define what is
>>meant
>>by compliance, something akin to a set of RFC's, or a set of docs like the
>> IEEE POSIX specifications.
>>
>>For example, the POSIX.1c pthreads API has a written document that
>>specifies
>>all the function calls, input params, return values, and error codes. It
>>clearly indicates what any POSIX-complaint threads package needs to
>>support,
>>and what are vendor-specific non-portable extensions that one can use at
>>one's own risk.
>>
>>Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
>>specification is extracted only by looking at the code, or (where the code
>>is non-trivial) by writing really bizarre test programs to examine corner
>>cases. Further, the interaction between a mix of the old and new APIs is
>>not
>>specified anywhere. Such specifications are vitally important when
>>implementing libraries like Cascading, Mahout, etc. For example, an
>>application might open a file using the new API, and pass that stream
>>into a
>>library that manipulates the stream using some of the old API ... what is
>>then the expectation of the state of the stream when the library call
>>returns?
>>
>>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail
>>such
>>things down. There's similar good effort in the Map/Reduce and  Avro
>>spaces,
>>but it seems to have stalled somewhat. We should continue it.
>>
>>Doing such specs would be a great service to the community and the users
>>of
>>Hadoop. It provides them
>>   (a) clear-cut docs on how to use the Hadoop APIs
>>   (b) wider choice of Hadoop implementations by freeing them from vendor
>>lock-in.
>>
>>Once we have such specification, the HCK becomes meaningful (since the HCK
>>itself will be buggy initially).
>>
>>
>>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar
>><mbhandarkar@linkedin.com
>>> wrote:
>>
>>> I think it's time to separate out functional tests as a "Hadoop
>>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL
>>> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite."
>>>
>>> - milind
>>> --
>>> Milind Bhandarkar
>>> mbhandarkar@linkedin.com
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com> wrote:
>>>
>>> >This is a really interesting topic!  I completely agree that we need to
>>> >get ahead of this.
>>> >
>>> >I would be really interested in learning of any experience other apache
>>> >projects, such as apache or tomcat have with these issues.
>>> >
>>> >---
>>> >E14 - typing on glass
>>> >
>>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org>
>>>wrote:
>>> >
>>> >>
>>> >> Back in Jan 2011, I started a discussion about how to define Apache
>>> >> Hadoop Compatibility:
>>> >>
>>> >>
>>>
>>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D
>>> >>46B6AD.2020802@apache.org%3E
>>> >>
>>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
>>> >>
>>> >>
>>>
>>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_
>>>>>1
>>> .
>>> >>pdf
>>> >>
>>> >> It claims that their implementations are 100% compatible, even though
>>> >> the Enterprise edition uses a C filesystem. It also claims that both
>>> >> their software releases contain "Certified Stacks", without defining
>>> >> what Certified means, or who does the certification -only that it is
>>>an
>>> >> improvement.
>>> >>
>>> >>
>>> >> I think we should revisit this issue before people with their own
>>> >> agendas define what compatibility with Apache Hadoop is for us
>>> >>
>>> >>
>>> >> Licensing
>>> >> -Use of the Hadoop codebase must follow the Apache License
>>> >> http://www.apache.org/licenses/LICENSE-2.0
>>> >> -plug in components that are dynamically linked to (Filesystems and
>>> >> schedulers) don't appear to be derivative works on my reading of
>>>this,
>>> >>
>>> >> Naming
>>> >>  -this is something for branding@apache, they will have their
>>>opinions.
>>> >> The key one is that the name "Apache Hadoop" must get used, and it's
>>> >> important to make clear it is a derivative work.
>>> >>  -I don't think you can claim to have a Distribution/Fork/Version of
>>> >> Apache Hadoop if you swap out big chunks of it for alternate
>>> >> filesystems, MR engines, etc. Some description of this is needed
>>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem
>>>XYZ"
>>> >>
>>> >> Compatibility
>>> >>  -the definition of the Hadoop interfaces and classes is the Apache
>>> >> Source tree,
>>> >>  -the definition of semantics of the Hadoop interfaces and classes
is
>>> >> the Apache Source tree, including the test classes.
>>> >>  -the verification that the actual semantics of an Apache Hadoop
>>> >> release is compatible with the expected semantics is that current and
>>> >> future tests pass
>>> >>  -bug reports can highlight incompatibility with expectations of
>>> >> community users, and once incorporated into tests form part of the
>>> >> compatibility testing
>>> >>  -vendors can claim and even certify their derivative works as
>>> >> compatible with other versions of their derivative works, but cannot
>>> >> claim compatibility with Apache Hadoop unless their code passes the
>>> >> tests and is consistent with the bug reports marked as ("by design").
>>> >> Perhaps we should have tests that verify each of these "by design"
>>> >> bugreps to make them more formal.
>>> >>
>>> >> Certification
>>> >>  -I have no idea what this means in EMC's case, they just say
>>> >>"Certified"
>>> >>  -As we don't do any certification ourselves, it would seem
>>>impossible
>>> >> for us to certify that any derivative work is compatible.
>>> >>  -It may be best to state that nobody can certify their derivative
as
>>> >> "compatible with Apache Hadoop" unless it passes all current test
>>>suites
>>> >>  -And require that anyone who declares compatibility define what they
>>> >> mean by this
>>> >>
>>> >> This is a good argument for getting more functional tests out there
>>> >> -whoever has more functional tests needs to get them into a test
>>>module
>>> >> that can be used to test real deployments.
>>> >>
>>>
>>>
>
>

Mime
View raw message