hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind Bhandarkar <mbhandar...@linkedin.com>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Fri, 13 May 2011 07:11:12 GMT
Cos,

I remember the issues about the "inter-component interactions" at that
point when you were part of the Yahoo Hadoop FIT team (I was on the other
side of the same floor, remember ? ;-)

Things like, "Can Pig take full URIs as input, and so works with viewfs",
"Can Local jobtracker still use HDFS as input and output", "Can Oozie use
local file system to keep workflows, while the jars were located on hdfs"
etc came up often.

Each of these issues were component-interaction issues, and were results
of making DistributedFileSystem a public class, or some subtle dependency
on the semantics of a particular method in an interface, which were not
explicit in the syntax.

That's an issue with interface-compatibility, and so merely compiling
against a particular interface is not a solution. One needs a test-suite.

(With annotations in Java, one can impose more semantic restrictions on
the interface, that can be automatically checked against at runtime. But
is limited to individual methods, or the full class. Code generation using
perl or whatever is similar in capability.)

- milind
-- 
Milind Bhandarkar
mbhandarkar@linkedin.com
+1-650-776-3167






On 5/12/11 11:24 PM, "Konstantin Boudnik" <cos@apache.org> wrote:

>On Thu, May 12, 2011 at 20:40, Milind Bhandarkar
><mbhandarkar@linkedin.com> wrote:
>> Cos,
>>
>> Can you give me an example of a "system test" that is not a functional
>> test ? My assumption was that the functionality being tested is specific
>> to a component, and that inter-component interactions (that's what you
>> meant, right?) would be taken care by the public interface and semantics
>> of a component API.
>
>Milind, kinda... However, to exercise inter-component interactions via
>component APIs one needs to have tests which are beyond functional or
>component realm (e.g. system). At some point  I was part of a team
>working on integration validation framework for Hadoop (FIT) which was
>addressing inter-component interaction validations essentially
>guaranteeing their compatibility. Components being Hadoop, Pig, Oozie,
>etc. - thus massaging the whole stack of application and covering a
>lot of use cases.
>
>Having a framework like this and a set of test cases available for
>Hadoop community is a great benefit because one can quickly make sure
>that a Hadoop stack built from a set of components is working
>property. Another use case is to run the same set of tests - versioned
>separately from the product itself - against previous and a next
>release validating their compatibility at the functional level (sorta
>what you have mentioned).
>
>This doesn't by the way deploy if we'd choose to work on HCK or not,
>however HCK might be eventually based on top of such a framework.
>
>Cos
>
>> - milind
>>
>> --
>> Milind Bhandarkar
>> mbhandarkar@linkedin.com
>> +1-650-776-3167
>>
>>
>>
>>
>>
>>
>> On 5/12/11 3:30 PM, "Konstantin Boudnik" <cos@apache.org> wrote:
>>
>>>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar
>>><mbhandarkar@linkedin.com> wrote:
>>>> HCK and written specifications are not mutually exclusive. However,
>>>>given
>>>> the evolving nature of Hadoop APIs, functional tests need to evolve as
>>>
>>>I would actually expand it to 'functional and system tests' because
>>>latter are capable of validating inter-component iterations not
>>>coverable by functional tests.
>>>
>>>Cos
>>>
>>>> well, and having them tied to a "current stable" version is easier to
>>>>do
>>>> than it is to tie the written specifications.
>>>>
>>>> - milind
>>>>
>>>> --
>>>> Milind Bhandarkar
>>>> mbhandarkar@linkedin.com
>>>> +1-650-776-3167
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 5/11/11 7:26 PM, "M. C. Srivas" <mcsrivas@gmail.com> wrote:
>>>>
>>>>>While the HCK is a great idea to check quickly if an implementation is
>>>>>"compliant",  we still need a written specification to define what is
>>>>>meant
>>>>>by compliance, something akin to a set of RFC's, or a set of docs like
>>>>>the
>>>>> IEEE POSIX specifications.
>>>>>
>>>>>For example, the POSIX.1c pthreads API has a written document that
>>>>>specifies
>>>>>all the function calls, input params, return values, and error codes.
>>>>>It
>>>>>clearly indicates what any POSIX-complaint threads package needs to
>>>>>support,
>>>>>and what are vendor-specific non-portable extensions that one can use
>>>>>at
>>>>>one's own risk.
>>>>>
>>>>>Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and
>>>>>the
>>>>>specification is extracted only by looking at the code, or (where the
>>>>>code
>>>>>is non-trivial) by writing really bizarre test programs to examine
>>>>>corner
>>>>>cases. Further, the interaction between a mix of the old and new APIs
>>>>>is
>>>>>not
>>>>>specified anywhere. Such specifications are vitally important when
>>>>>implementing libraries like Cascading, Mahout, etc. For example, an
>>>>>application might open a file using the new API, and pass that stream
>>>>>into a
>>>>>library that manipulates the stream using some of the old API ... what
>>>>>is
>>>>>then the expectation of the state of the stream when the library call
>>>>>returns?
>>>>>
>>>>>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail
>>>>>such
>>>>>things down. There's similar good effort in the Map/Reduce and  Avro
>>>>>spaces,
>>>>>but it seems to have stalled somewhat. We should continue it.
>>>>>
>>>>>Doing such specs would be a great service to the community and the
>>>>>users
>>>>>of
>>>>>Hadoop. It provides them
>>>>>   (a) clear-cut docs on how to use the Hadoop APIs
>>>>>   (b) wider choice of Hadoop implementations by freeing them from
>>>>>vendor
>>>>>lock-in.
>>>>>
>>>>>Once we have such specification, the HCK becomes meaningful (since the
>>>>>HCK
>>>>>itself will be buggy initially).
>>>>>
>>>>>
>>>>>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar
>>>>><mbhandarkar@linkedin.com
>>>>>> wrote:
>>>>>
>>>>>> I think it's time to separate out functional tests as a "Hadoop
>>>>>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under
>>>>>>ASL
>>>>>> 2.0. Then "certification" would mean "Passes 100% of the HCK
>>>>>>testsuite."
>>>>>>
>>>>>> - milind
>>>>>> --
>>>>>> Milind Bhandarkar
>>>>>> mbhandarkar@linkedin.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <eric14@yahoo-inc.com>
>>>>>>wrote:
>>>>>>
>>>>>> >This is a really interesting topic!  I completely agree that
we
>>>>>>need
>>>>>>to
>>>>>> >get ahead of this.
>>>>>> >
>>>>>> >I would be really interested in learning of any experience other
>>>>>>apache
>>>>>> >projects, such as apache or tomcat have with these issues.
>>>>>> >
>>>>>> >---
>>>>>> >E14 - typing on glass
>>>>>> >
>>>>>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org>
>>>>>>wrote:
>>>>>> >
>>>>>> >>
>>>>>> >> Back in Jan 2011, I started a discussion about how to define
>>>>>>Apache
>>>>>> >> Hadoop Compatibility:
>>>>>> >>
>>>>>> >>
>>>>>>
>>>>>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%
>>>>>>3C
>>>>>>4D
>>>>>> >>46B6AD.2020802@apache.org%3E
>>>>>> >>
>>>>>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop
>>>>>>datasheet
>>>>>> >>
>>>>>> >>
>>>>>>
>>>>>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Fi
>>>>>>>>na
>>>>>>>>l_
>>>>>>>>1
>>>>>> .
>>>>>> >>pdf
>>>>>> >>
>>>>>> >> It claims that their implementations are 100% compatible,
even
>>>>>>though
>>>>>> >> the Enterprise edition uses a C filesystem. It also claims
that
>>>>>>both
>>>>>> >> their software releases contain "Certified Stacks", without
>>>>>>defining
>>>>>> >> what Certified means, or who does the certification -only
that it
>>>>>>is
>>>>>>an
>>>>>> >> improvement.
>>>>>> >>
>>>>>> >>
>>>>>> >> I think we should revisit this issue before people with
their own
>>>>>> >> agendas define what compatibility with Apache Hadoop is
for us
>>>>>> >>
>>>>>> >>
>>>>>> >> Licensing
>>>>>> >> -Use of the Hadoop codebase must follow the Apache License
>>>>>> >> http://www.apache.org/licenses/LICENSE-2.0
>>>>>> >> -plug in components that are dynamically linked to (Filesystems
>>>>>>and
>>>>>> >> schedulers) don't appear to be derivative works on my reading
of
>>>>>>this,
>>>>>> >>
>>>>>> >> Naming
>>>>>> >>  -this is something for branding@apache, they will have
their
>>>>>>opinions.
>>>>>> >> The key one is that the name "Apache Hadoop" must get used,
and
>>>>>>it's
>>>>>> >> important to make clear it is a derivative work.
>>>>>> >>  -I don't think you can claim to have a Distribution/Fork/Version
>>>>>>of
>>>>>> >> Apache Hadoop if you swap out big chunks of it for alternate
>>>>>> >> filesystems, MR engines, etc. Some description of this is
needed
>>>>>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem
>>>>>>XYZ"
>>>>>> >>
>>>>>> >> Compatibility
>>>>>> >>  -the definition of the Hadoop interfaces and classes is
the
>>>>>>Apache
>>>>>> >> Source tree,
>>>>>> >>  -the definition of semantics of the Hadoop interfaces and
>>>>>>classes
>>>>>>is
>>>>>> >> the Apache Source tree, including the test classes.
>>>>>> >>  -the verification that the actual semantics of an Apache
Hadoop
>>>>>> >> release is compatible with the expected semantics is that
current
>>>>>>and
>>>>>> >> future tests pass
>>>>>> >>  -bug reports can highlight incompatibility with expectations
of
>>>>>> >> community users, and once incorporated into tests form part
of
>>>>>>the
>>>>>> >> compatibility testing
>>>>>> >>  -vendors can claim and even certify their derivative works
as
>>>>>> >> compatible with other versions of their derivative works,
but
>>>>>>cannot
>>>>>> >> claim compatibility with Apache Hadoop unless their code
passes
>>>>>>the
>>>>>> >> tests and is consistent with the bug reports marked as ("by
>>>>>>design").
>>>>>> >> Perhaps we should have tests that verify each of these "by
>>>>>>design"
>>>>>> >> bugreps to make them more formal.
>>>>>> >>
>>>>>> >> Certification
>>>>>> >>  -I have no idea what this means in EMC's case, they just
say
>>>>>> >>"Certified"
>>>>>> >>  -As we don't do any certification ourselves, it would seem
>>>>>>impossible
>>>>>> >> for us to certify that any derivative work is compatible.
>>>>>> >>  -It may be best to state that nobody can certify their
>>>>>>derivative
>>>>>>as
>>>>>> >> "compatible with Apache Hadoop" unless it passes all current
test
>>>>>>suites
>>>>>> >>  -And require that anyone who declares compatibility define
what
>>>>>>they
>>>>>> >> mean by this
>>>>>> >>
>>>>>> >> This is a good argument for getting more functional tests
out
>>>>>>there
>>>>>> >> -whoever has more functional tests needs to get them into
a test
>>>>>>module
>>>>>> >> that can be used to test real deployments.
>>>>>> >>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>


Mime
View raw message