hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@apache.org>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Fri, 13 May 2011 17:47:36 GMT
On Fri, May 13, 2011 at 00:11, Milind Bhandarkar
<mbhandarkar@linkedin.com> wrote:
> Cos,
>
> I remember the issues about the "inter-component interactions" at that
> point when you were part of the Yahoo Hadoop FIT team (I was on the other
> side of the same floor, remember ? ;-)

Vaguely ;) Of course I remember. But I prefer not to mentioned any
internal technologies developed for private companies after getting
lashes for that.

> Things like, "Can Pig take full URIs as input, and so works with viewfs",
> "Can Local jobtracker still use HDFS as input and output", "Can Oozie use
> local file system to keep workflows, while the jars were located on hdfs"
> etc came up often.
>
> Each of these issues were component-interaction issues, and were results
> of making DistributedFileSystem a public class, or some subtle dependency
> on the semantics of a particular method in an interface, which were not
> explicit in the syntax.
>
> That's an issue with interface-compatibility, and so merely compiling
> against a particular interface is not a solution. One needs a test-suite.

One needs more than a mere test-suite if experience teaches us
anything. FIT and its continuation turns to be a complex program (not
only in a sense of computer code) with many moving parts, bells and
whistles. One of those was a set of specs actually written in English
language. The downside is that someone needs to keep them up to day,
translate them into test cases or teach others how to do it, etc. That
exactly why TCK was using a test generator and used somewhat
formalized spec language.

Cos

> (With annotations in Java, one can impose more semantic restrictions on
> the interface, that can be automatically checked against at runtime. But
> is limited to individual methods, or the full class. Code generation using
> perl or whatever is similar in capability.)
>
> - milind
> --
> Milind Bhandarkar
> mbhandarkar@linkedin.com
> +1-650-776-3167
>
>
>
>
>
>
> On 5/12/11 11:24 PM, "Konstantin Boudnik" <cos@apache.org> wrote:
>
>>On Thu, May 12, 2011 at 20:40, Milind Bhandarkar
>><mbhandarkar@linkedin.com> wrote:
>>> Cos,
>>>
>>> Can you give me an example of a "system test" that is not a functional
>>> test ? My assumption was that the functionality being tested is specific
>>> to a component, and that inter-component interactions (that's what you
>>> meant, right?) would be taken care by the public interface and semantics
>>> of a component API.
>>
>>Milind, kinda... However, to exercise inter-component interactions via
>>component APIs one needs to have tests which are beyond functional or
>>component realm (e.g. system). At some point  I was part of a team
>>working on integration validation framework for Hadoop (FIT) which was
>>addressing inter-component interaction validations essentially
>>guaranteeing their compatibility. Components being Hadoop, Pig, Oozie,
>>etc. - thus massaging the whole stack of application and covering a
>>lot of use cases.
>>
>>Having a framework like this and a set of test cases available for
>>Hadoop community is a great benefit because one can quickly make sure
>>that a Hadoop stack built from a set of components is working
>>property. Another use case is to run the same set of tests - versioned
>>separately from the product itself - against previous and a next
>>release validating their compatibility at the functional level (sorta
>>what you have mentioned).
>>
>>This doesn't by the way deploy if we'd choose to work on HCK or not,
>>however HCK might be eventually based on top of such a framework.
>>
>>Cos
>>
>>> - milind
>>>
>>> --
>>> Milind Bhandarkar
>>> mbhandarkar@linkedin.com
>>> +1-650-776-3167
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5/12/11 3:30 PM, "Konstantin Boudnik" <cos@apache.org> wrote:
>>>
>>>>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar
>>>><mbhandarkar@linkedin.com> wrote:
>>>>> HCK and written specifications are not mutually exclusive. However,
>>>>>given
>>>>> the evolving nature of Hadoop APIs, functional tests need to evolve as
>>>>
>>>>I would actually expand it to 'functional and system tests' because
>>>>latter are capable of validating inter-component iterations not
>>>>coverable by functional tests.
>>>>
>>>>Cos
>>>>
>>>>> well, and having them tied to a "current stable" version is easier to
>>>>>do
>>>>> than it is to tie the written specifications.
>>>>>
>>>>> - milind
>>>>>
>>>>> --
>>>>> Milind Bhandarkar
>>>>> mbhandarkar@linkedin.com
>>>>> +1-650-776-3167
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 5/11/11 7:26 PM, "M. C. Srivas" <mcsrivas@gmail.com> wrote:
>>>>>
>>>>>>While the HCK is a great idea to check quickly if an implementation
is
>>>>>>"compliant",  we still need a written specification to define what
is
>>>>>>meant
>>>>>>by compliance, something akin to a set of RFC's, or a set of docs
like
>>>>>>the
>>>>>> IEEE POSIX specifications.
>>>>>>
>>>>>>For example, the POSIX.1c pthreads API has a written document that
>>>>>>specifies
>>>>>>all the function calls, input params, return values, and error codes.
>>>>>>It
>>>>>>clearly indicates what any POSIX-complaint threads package needs to
>>>>>>support,
>>>>>>and what are vendor-specific non-portable extensions that one can
use
>>>>>>at
>>>>>>one's own risk.
>>>>>>
>>>>>>Currently we have 2 sets of API  in the DFS and Map/Reduce layers,
and
>>>>>>the
>>>>>>specification is extracted only by looking at the code, or (where
the
>>>>>>code
>>>>>>is non-trivial) by writing really bizarre test programs to examine
>>>>>>corner
>>>>>>cases. Further, the interaction between a mix of the old and new APIs
>>>>>>is
>>>>>>not
>>>>>>specified anywhere. Such specifications are vitally important when
>>>>>>implementing libraries like Cascading, Mahout, etc. For example, an
>>>>>>application might open a file using the new API, and pass that stream
>>>>>>into a
>>>>>>library that manipulates the stream using some of the old API ...
what
>>>>>>is
>>>>>>then the expectation of the state of the stream when the library call
>>>>>>returns?
>>>>>>
>>>>>>Sanjay Radia @ Y! already started specifying some the DFS APIs to
nail
>>>>>>such
>>>>>>things down. There's similar good effort in the Map/Reduce and  Avro
>>>>>>spaces,
>>>>>>but it seems to have stalled somewhat. We should continue it.
>>>>>>
>>>>>>Doing such specs would be a great service to the community and the
>>>>>>users
>>>>>>of
>>>>>>Hadoop. It provides them
>>>>>>   (a) clear-cut docs on how to use the Hadoop APIs
>>>>>>   (b) wider choice of Hadoop implementations by freeing them from
>>>>>>vendor
>>>>>>lock-in.
>>>>>>
>>>>>>Once we have such specification, the HCK becomes meaningful (since
the
>>>>>>HCK
>>>>>>itself will be buggy initially).
>>>>>>
>>>>>>
>>>>>>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar
>>>>>><mbhandarkar@linkedin.com
>>>>>>> wrote:
>>>>>>
>>>>>>> I think it's time to separate out functional tests as a "Hadoop
>>>>>>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but
under
>>>>>>>ASL
>>>>>>> 2.0. Then "certification" would mean "Passes 100% of the HCK
>>>>>>>testsuite."
>>>>>>>
>>>>>>> - milind
>>>>>>> --
>>>>>>> Milind Bhandarkar
>>>>>>> mbhandarkar@linkedin.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com>
>>>>>>>wrote:
>>>>>>>
>>>>>>> >This is a really interesting topic!  I completely agree
that we
>>>>>>>need
>>>>>>>to
>>>>>>> >get ahead of this.
>>>>>>> >
>>>>>>> >I would be really interested in learning of any experience
other
>>>>>>>apache
>>>>>>> >projects, such as apache or tomcat have with these issues.
>>>>>>> >
>>>>>>> >---
>>>>>>> >E14 - typing on glass
>>>>>>> >
>>>>>>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org>
>>>>>>>wrote:
>>>>>>> >
>>>>>>> >>
>>>>>>> >> Back in Jan 2011, I started a discussion about how to
define
>>>>>>>Apache
>>>>>>> >> Hadoop Compatibility:
>>>>>>> >>
>>>>>>> >>
>>>>>>>
>>>>>>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%
>>>>>>>3C
>>>>>>>4D
>>>>>>> >>46B6AD.2020802@apache.org%3E
>>>>>>> >>
>>>>>>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop
>>>>>>>datasheet
>>>>>>> >>
>>>>>>> >>
>>>>>>>
>>>>>>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Fi
>>>>>>>>>na
>>>>>>>>>l_
>>>>>>>>>1
>>>>>>> .
>>>>>>> >>pdf
>>>>>>> >>
>>>>>>> >> It claims that their implementations are 100% compatible,
even
>>>>>>>though
>>>>>>> >> the Enterprise edition uses a C filesystem. It also
claims that
>>>>>>>both
>>>>>>> >> their software releases contain "Certified Stacks",
without
>>>>>>>defining
>>>>>>> >> what Certified means, or who does the certification
-only that it
>>>>>>>is
>>>>>>>an
>>>>>>> >> improvement.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> I think we should revisit this issue before people with
their own
>>>>>>> >> agendas define what compatibility with Apache Hadoop
is for us
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> Licensing
>>>>>>> >> -Use of the Hadoop codebase must follow the Apache License
>>>>>>> >> http://www.apache.org/licenses/LICENSE-2.0
>>>>>>> >> -plug in components that are dynamically linked to (Filesystems
>>>>>>>and
>>>>>>> >> schedulers) don't appear to be derivative works on my
reading of
>>>>>>>this,
>>>>>>> >>
>>>>>>> >> Naming
>>>>>>> >>  -this is something for branding@apache, they will
have their
>>>>>>>opinions.
>>>>>>> >> The key one is that the name "Apache Hadoop" must get
used, and
>>>>>>>it's
>>>>>>> >> important to make clear it is a derivative work.
>>>>>>> >>  -I don't think you can claim to have a Distribution/Fork/Version
>>>>>>>of
>>>>>>> >> Apache Hadoop if you swap out big chunks of it for alternate
>>>>>>> >> filesystems, MR engines, etc. Some description of this
is needed
>>>>>>> >> "Supports the Apache Hadoop MapReduce engine on top
of Filesystem
>>>>>>>XYZ"
>>>>>>> >>
>>>>>>> >> Compatibility
>>>>>>> >>  -the definition of the Hadoop interfaces and classes
is the
>>>>>>>Apache
>>>>>>> >> Source tree,
>>>>>>> >>  -the definition of semantics of the Hadoop interfaces
and
>>>>>>>classes
>>>>>>>is
>>>>>>> >> the Apache Source tree, including the test classes.
>>>>>>> >>  -the verification that the actual semantics of an
Apache Hadoop
>>>>>>> >> release is compatible with the expected semantics is
that current
>>>>>>>and
>>>>>>> >> future tests pass
>>>>>>> >>  -bug reports can highlight incompatibility with expectations
of
>>>>>>> >> community users, and once incorporated into tests form
part of
>>>>>>>the
>>>>>>> >> compatibility testing
>>>>>>> >>  -vendors can claim and even certify their derivative
works as
>>>>>>> >> compatible with other versions of their derivative works,
but
>>>>>>>cannot
>>>>>>> >> claim compatibility with Apache Hadoop unless their
code passes
>>>>>>>the
>>>>>>> >> tests and is consistent with the bug reports marked
as ("by
>>>>>>>design").
>>>>>>> >> Perhaps we should have tests that verify each of these
"by
>>>>>>>design"
>>>>>>> >> bugreps to make them more formal.
>>>>>>> >>
>>>>>>> >> Certification
>>>>>>> >>  -I have no idea what this means in EMC's case, they
just say
>>>>>>> >>"Certified"
>>>>>>> >>  -As we don't do any certification ourselves, it would
seem
>>>>>>>impossible
>>>>>>> >> for us to certify that any derivative work is compatible.
>>>>>>> >>  -It may be best to state that nobody can certify their
>>>>>>>derivative
>>>>>>>as
>>>>>>> >> "compatible with Apache Hadoop" unless it passes all
current test
>>>>>>>suites
>>>>>>> >>  -And require that anyone who declares compatibility
define what
>>>>>>>they
>>>>>>> >> mean by this
>>>>>>> >>
>>>>>>> >> This is a good argument for getting more functional
tests out
>>>>>>>there
>>>>>>> >> -whoever has more functional tests needs to get them
into a test
>>>>>>>module
>>>>>>> >> that can be used to test real deployments.
>>>>>>> >>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>>>
>
>

Mime
View raw message