hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind Bhandarkar <mbhandar...@linkedin.com>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Fri, 13 May 2011 03:40:46 GMT

Can you give me an example of a "system test" that is not a functional
test ? My assumption was that the functionality being tested is specific
to a component, and that inter-component interactions (that's what you
meant, right?) would be taken care by the public interface and semantics
of a component API.

- milind

Milind Bhandarkar

On 5/12/11 3:30 PM, "Konstantin Boudnik" <cos@apache.org> wrote:

>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar
><mbhandarkar@linkedin.com> wrote:
>> HCK and written specifications are not mutually exclusive. However,
>> the evolving nature of Hadoop APIs, functional tests need to evolve as
>I would actually expand it to 'functional and system tests' because
>latter are capable of validating inter-component iterations not
>coverable by functional tests.
>> well, and having them tied to a "current stable" version is easier to do
>> than it is to tie the written specifications.
>> - milind
>> --
>> Milind Bhandarkar
>> mbhandarkar@linkedin.com
>> +1-650-776-3167
>> On 5/11/11 7:26 PM, "M. C. Srivas" <mcsrivas@gmail.com> wrote:
>>>While the HCK is a great idea to check quickly if an implementation is
>>>"compliant",  we still need a written specification to define what is
>>>by compliance, something akin to a set of RFC's, or a set of docs like
>>> IEEE POSIX specifications.
>>>For example, the POSIX.1c pthreads API has a written document that
>>>all the function calls, input params, return values, and error codes. It
>>>clearly indicates what any POSIX-complaint threads package needs to
>>>and what are vendor-specific non-portable extensions that one can use at
>>>one's own risk.
>>>Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and
>>>specification is extracted only by looking at the code, or (where the
>>>is non-trivial) by writing really bizarre test programs to examine
>>>cases. Further, the interaction between a mix of the old and new APIs is
>>>specified anywhere. Such specifications are vitally important when
>>>implementing libraries like Cascading, Mahout, etc. For example, an
>>>application might open a file using the new API, and pass that stream
>>>into a
>>>library that manipulates the stream using some of the old API ... what
>>>then the expectation of the state of the stream when the library call
>>>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail
>>>things down. There's similar good effort in the Map/Reduce and  Avro
>>>but it seems to have stalled somewhat. We should continue it.
>>>Doing such specs would be a great service to the community and the users
>>>Hadoop. It provides them
>>>   (a) clear-cut docs on how to use the Hadoop APIs
>>>   (b) wider choice of Hadoop implementations by freeing them from
>>>Once we have such specification, the HCK becomes meaningful (since the
>>>itself will be buggy initially).
>>>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar
>>>> wrote:
>>>> I think it's time to separate out functional tests as a "Hadoop
>>>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under
>>>> 2.0. Then "certification" would mean "Passes 100% of the HCK
>>>> - milind
>>>> --
>>>> Milind Bhandarkar
>>>> mbhandarkar@linkedin.com
>>>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com>
>>>> >This is a really interesting topic!  I completely agree that we need
>>>> >get ahead of this.
>>>> >
>>>> >I would be really interested in learning of any experience other
>>>> >projects, such as apache or tomcat have with these issues.
>>>> >
>>>> >---
>>>> >E14 - typing on glass
>>>> >
>>>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org>
>>>> >
>>>> >>
>>>> >> Back in Jan 2011, I started a discussion about how to define Apache
>>>> >> Hadoop Compatibility:
>>>> >>
>>>> >>
>>>> >>46B6AD.2020802@apache.org%3E
>>>> >>
>>>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
>>>> >>
>>>> >>
>>>> .
>>>> >>pdf
>>>> >>
>>>> >> It claims that their implementations are 100% compatible, even
>>>> >> the Enterprise edition uses a C filesystem. It also claims that
>>>> >> their software releases contain "Certified Stacks", without
>>>> >> what Certified means, or who does the certification -only that it
>>>> >> improvement.
>>>> >>
>>>> >>
>>>> >> I think we should revisit this issue before people with their own
>>>> >> agendas define what compatibility with Apache Hadoop is for us
>>>> >>
>>>> >>
>>>> >> Licensing
>>>> >> -Use of the Hadoop codebase must follow the Apache License
>>>> >> http://www.apache.org/licenses/LICENSE-2.0
>>>> >> -plug in components that are dynamically linked to (Filesystems
>>>> >> schedulers) don't appear to be derivative works on my reading of
>>>> >>
>>>> >> Naming
>>>> >>  -this is something for branding@apache, they will have their
>>>> >> The key one is that the name "Apache Hadoop" must get used, and
>>>> >> important to make clear it is a derivative work.
>>>> >>  -I don't think you can claim to have a Distribution/Fork/Version
>>>> >> Apache Hadoop if you swap out big chunks of it for alternate
>>>> >> filesystems, MR engines, etc. Some description of this is needed
>>>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem
>>>> >>
>>>> >> Compatibility
>>>> >>  -the definition of the Hadoop interfaces and classes is the Apache
>>>> >> Source tree,
>>>> >>  -the definition of semantics of the Hadoop interfaces and classes
>>>> >> the Apache Source tree, including the test classes.
>>>> >>  -the verification that the actual semantics of an Apache Hadoop
>>>> >> release is compatible with the expected semantics is that current
>>>> >> future tests pass
>>>> >>  -bug reports can highlight incompatibility with expectations of
>>>> >> community users, and once incorporated into tests form part of the
>>>> >> compatibility testing
>>>> >>  -vendors can claim and even certify their derivative works as
>>>> >> compatible with other versions of their derivative works, but
>>>> >> claim compatibility with Apache Hadoop unless their code passes
>>>> >> tests and is consistent with the bug reports marked as ("by
>>>> >> Perhaps we should have tests that verify each of these "by design"
>>>> >> bugreps to make them more formal.
>>>> >>
>>>> >> Certification
>>>> >>  -I have no idea what this means in EMC's case, they just say
>>>> >>"Certified"
>>>> >>  -As we don't do any certification ourselves, it would seem
>>>> >> for us to certify that any derivative work is compatible.
>>>> >>  -It may be best to state that nobody can certify their derivative
>>>> >> "compatible with Apache Hadoop" unless it passes all current test
>>>> >>  -And require that anyone who declares compatibility define what
>>>> >> mean by this
>>>> >>
>>>> >> This is a good argument for getting more functional tests out there
>>>> >> -whoever has more functional tests needs to get them into a test
>>>> >> that can be used to test real deployments.
>>>> >>

View raw message