hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@apache.org>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Fri, 13 May 2011 06:24:51 GMT
On Thu, May 12, 2011 at 20:40, Milind Bhandarkar
<mbhandarkar@linkedin.com> wrote:
> Cos,
>
> Can you give me an example of a "system test" that is not a functional
> test ? My assumption was that the functionality being tested is specific
> to a component, and that inter-component interactions (that's what you
> meant, right?) would be taken care by the public interface and semantics
> of a component API.

Milind, kinda... However, to exercise inter-component interactions via
component APIs one needs to have tests which are beyond functional or
component realm (e.g. system). At some point  I was part of a team
working on integration validation framework for Hadoop (FIT) which was
addressing inter-component interaction validations essentially
guaranteeing their compatibility. Components being Hadoop, Pig, Oozie,
etc. - thus massaging the whole stack of application and covering a
lot of use cases.

Having a framework like this and a set of test cases available for
Hadoop community is a great benefit because one can quickly make sure
that a Hadoop stack built from a set of components is working
property. Another use case is to run the same set of tests - versioned
separately from the product itself - against previous and a next
release validating their compatibility at the functional level (sorta
what you have mentioned).

This doesn't by the way deploy if we'd choose to work on HCK or not,
however HCK might be eventually based on top of such a framework.

Cos

> - milind
>
> --
> Milind Bhandarkar
> mbhandarkar@linkedin.com
> +1-650-776-3167
>
>
>
>
>
>
> On 5/12/11 3:30 PM, "Konstantin Boudnik" <cos@apache.org> wrote:
>
>>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar
>><mbhandarkar@linkedin.com> wrote:
>>> HCK and written specifications are not mutually exclusive. However,
>>>given
>>> the evolving nature of Hadoop APIs, functional tests need to evolve as
>>
>>I would actually expand it to 'functional and system tests' because
>>latter are capable of validating inter-component iterations not
>>coverable by functional tests.
>>
>>Cos
>>
>>> well, and having them tied to a "current stable" version is easier to do
>>> than it is to tie the written specifications.
>>>
>>> - milind
>>>
>>> --
>>> Milind Bhandarkar
>>> mbhandarkar@linkedin.com
>>> +1-650-776-3167
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5/11/11 7:26 PM, "M. C. Srivas" <mcsrivas@gmail.com> wrote:
>>>
>>>>While the HCK is a great idea to check quickly if an implementation is
>>>>"compliant",  we still need a written specification to define what is
>>>>meant
>>>>by compliance, something akin to a set of RFC's, or a set of docs like
>>>>the
>>>> IEEE POSIX specifications.
>>>>
>>>>For example, the POSIX.1c pthreads API has a written document that
>>>>specifies
>>>>all the function calls, input params, return values, and error codes. It
>>>>clearly indicates what any POSIX-complaint threads package needs to
>>>>support,
>>>>and what are vendor-specific non-portable extensions that one can use at
>>>>one's own risk.
>>>>
>>>>Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and
>>>>the
>>>>specification is extracted only by looking at the code, or (where the
>>>>code
>>>>is non-trivial) by writing really bizarre test programs to examine
>>>>corner
>>>>cases. Further, the interaction between a mix of the old and new APIs is
>>>>not
>>>>specified anywhere. Such specifications are vitally important when
>>>>implementing libraries like Cascading, Mahout, etc. For example, an
>>>>application might open a file using the new API, and pass that stream
>>>>into a
>>>>library that manipulates the stream using some of the old API ... what
>>>>is
>>>>then the expectation of the state of the stream when the library call
>>>>returns?
>>>>
>>>>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail
>>>>such
>>>>things down. There's similar good effort in the Map/Reduce and  Avro
>>>>spaces,
>>>>but it seems to have stalled somewhat. We should continue it.
>>>>
>>>>Doing such specs would be a great service to the community and the users
>>>>of
>>>>Hadoop. It provides them
>>>>   (a) clear-cut docs on how to use the Hadoop APIs
>>>>   (b) wider choice of Hadoop implementations by freeing them from
>>>>vendor
>>>>lock-in.
>>>>
>>>>Once we have such specification, the HCK becomes meaningful (since the
>>>>HCK
>>>>itself will be buggy initially).
>>>>
>>>>
>>>>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar
>>>><mbhandarkar@linkedin.com
>>>>> wrote:
>>>>
>>>>> I think it's time to separate out functional tests as a "Hadoop
>>>>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under
>>>>>ASL
>>>>> 2.0. Then "certification" would mean "Passes 100% of the HCK
>>>>>testsuite."
>>>>>
>>>>> - milind
>>>>> --
>>>>> Milind Bhandarkar
>>>>> mbhandarkar@linkedin.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com>
>>>>>wrote:
>>>>>
>>>>> >This is a really interesting topic!  I completely agree that we
need
>>>>>to
>>>>> >get ahead of this.
>>>>> >
>>>>> >I would be really interested in learning of any experience other
>>>>>apache
>>>>> >projects, such as apache or tomcat have with these issues.
>>>>> >
>>>>> >---
>>>>> >E14 - typing on glass
>>>>> >
>>>>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org>
>>>>>wrote:
>>>>> >
>>>>> >>
>>>>> >> Back in Jan 2011, I started a discussion about how to define
Apache
>>>>> >> Hadoop Compatibility:
>>>>> >>
>>>>> >>
>>>>>
>>>>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C
>>>>>4D
>>>>> >>46B6AD.2020802@apache.org%3E
>>>>> >>
>>>>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
>>>>> >>
>>>>> >>
>>>>>
>>>>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Fina
>>>>>>>l_
>>>>>>>1
>>>>> .
>>>>> >>pdf
>>>>> >>
>>>>> >> It claims that their implementations are 100% compatible, even
>>>>>though
>>>>> >> the Enterprise edition uses a C filesystem. It also claims that
>>>>>both
>>>>> >> their software releases contain "Certified Stacks", without
>>>>>defining
>>>>> >> what Certified means, or who does the certification -only that
it
>>>>>is
>>>>>an
>>>>> >> improvement.
>>>>> >>
>>>>> >>
>>>>> >> I think we should revisit this issue before people with their
own
>>>>> >> agendas define what compatibility with Apache Hadoop is for
us
>>>>> >>
>>>>> >>
>>>>> >> Licensing
>>>>> >> -Use of the Hadoop codebase must follow the Apache License
>>>>> >> http://www.apache.org/licenses/LICENSE-2.0
>>>>> >> -plug in components that are dynamically linked to (Filesystems
and
>>>>> >> schedulers) don't appear to be derivative works on my reading
of
>>>>>this,
>>>>> >>
>>>>> >> Naming
>>>>> >>  -this is something for branding@apache, they will have their
>>>>>opinions.
>>>>> >> The key one is that the name "Apache Hadoop" must get used,
and
>>>>>it's
>>>>> >> important to make clear it is a derivative work.
>>>>> >>  -I don't think you can claim to have a Distribution/Fork/Version
>>>>>of
>>>>> >> Apache Hadoop if you swap out big chunks of it for alternate
>>>>> >> filesystems, MR engines, etc. Some description of this is needed
>>>>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem
>>>>>XYZ"
>>>>> >>
>>>>> >> Compatibility
>>>>> >>  -the definition of the Hadoop interfaces and classes is the
Apache
>>>>> >> Source tree,
>>>>> >>  -the definition of semantics of the Hadoop interfaces and
classes
>>>>>is
>>>>> >> the Apache Source tree, including the test classes.
>>>>> >>  -the verification that the actual semantics of an Apache Hadoop
>>>>> >> release is compatible with the expected semantics is that current
>>>>>and
>>>>> >> future tests pass
>>>>> >>  -bug reports can highlight incompatibility with expectations
of
>>>>> >> community users, and once incorporated into tests form part
of the
>>>>> >> compatibility testing
>>>>> >>  -vendors can claim and even certify their derivative works
as
>>>>> >> compatible with other versions of their derivative works, but
>>>>>cannot
>>>>> >> claim compatibility with Apache Hadoop unless their code passes
the
>>>>> >> tests and is consistent with the bug reports marked as ("by
>>>>>design").
>>>>> >> Perhaps we should have tests that verify each of these "by design"
>>>>> >> bugreps to make them more formal.
>>>>> >>
>>>>> >> Certification
>>>>> >>  -I have no idea what this means in EMC's case, they just say
>>>>> >>"Certified"
>>>>> >>  -As we don't do any certification ourselves, it would seem
>>>>>impossible
>>>>> >> for us to certify that any derivative work is compatible.
>>>>> >>  -It may be best to state that nobody can certify their derivative
>>>>>as
>>>>> >> "compatible with Apache Hadoop" unless it passes all current
test
>>>>>suites
>>>>> >>  -And require that anyone who declares compatibility define
what
>>>>>they
>>>>> >> mean by this
>>>>> >>
>>>>> >> This is a good argument for getting more functional tests out
there
>>>>> >> -whoever has more functional tests needs to get them into a
test
>>>>>module
>>>>> >> that can be used to test real deployments.
>>>>> >>
>>>>>
>>>>>
>>>
>>>
>
>

Mime
View raw message