hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind Bhandarkar <mbhandar...@linkedin.com>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Fri, 13 May 2011 03:37:49 GMT
The problem with (only) specs is that they are written in natural
language, and subject to human interpretation, and since humans are bad at
natural language interpretation, this gives rise to something called
standards bodies and lawyers, and that has never been good for anyone in
the past ;-)

Now consider this scenario:

$ bin/hadoop jar hck-0.20.2.jar --config <myconfig/dir>
... Bunch of output ...
Result: Tests run: 1000, Successful: 999 Failed: 1

This is much easier to interpret, even for humans.

The intention of formally defining compatibility is so that the programs
written for Apache Hadoop run unmodified for other open-source /
closed-source systems that claim to be "Apache Hadoop Compatible". Unless
it can be verified easily, the compatibility definition has no meaning.
So, standards that are only documented are useless.

By the way, one should also define "Apache Hadoop Source Compatible", and
"Apache Hadoop Binary Compatible", depending on whether one recompiles
src/hck/**.java and rebuilds hck.jar or not.

- milind

-- 
Milind Bhandarkar
mbhandarkar@linkedin.com
+1-650-776-3167






On 5/12/11 3:26 PM, "Konstantin Boudnik" <cos@apache.org> wrote:

>TCK (or JCK initially) was done as a tool to basically compare Java
>Lang specs with a particular implementation including but not limited
>to an extensive suite of say compiler tests.
>
>So I assume before we can embark on any sort of HCK suite some formal
>specs would have to be defined. It's rather hard to say that
>implementation X is(not) compatible with Apache Hadoop for the lack of
>API and spec level definition of what really comprise such an animal.
>
>As was mentioned someplace else in the thread there's certain effort
>happening to document DFS, MR, and Avro APIs. Seems like a very good
>start for Hadoop specs at large.
>--
>  Take care,
>Konstantin (Cos) Boudnik
>
>On Wed, May 11, 2011 at 16:20, Aaron Kimball <akimball83@gmail.com> wrote:
>> What does it mean to "implement" those interfaces? I'm +1 for a
>>TCK-based
>> definition. In addition to statically implementing a set of interfaces,
>>each
>> interface also implicitly includes a set of acceptable inputs and
>>predicted
>> outputs (or ranges of outputs) for those inputs.
>>
>> - Aaron
>>
>> On Wed, May 11, 2011 at 3:56 PM, Jacob R Rideout
>><apache@jacobrideout.net>wrote:
>>
>>> What about defining compatibility as fully implementing all the
>>> public-stable annotated interfaces for a particular release?
>>>
>>> Jacob Rideout
>>>
>>> On Wed, May 11, 2011 at 4:42 PM, Ian Holsman <hadoop@holsman.net>
>>>wrote:
>>> > For apache (httpd I'm assuming you mean). we define compatibility as
>>> adherence to the set of RFC's that define the HTTP protocol.
>>> >
>>> > I'm no expert in this (Roy is though), but we could attempt to do
>>> something similar when it comes to HDFS/Map-Reduce protocols. I'm not
>>>sure
>>> what benefit there would be to going to a RFC, as opposed to
>>>documenting the
>>> API on our site.
>>> >
>>> >
>>> > On May 12, 2011, at 7:24 AM, Eric Baldeschwieler wrote:
>>> >
>>> >> This is a really interesting topic!  I completely agree that we
>>>need to
>>> get ahead of this.
>>> >>
>>> >> I would be really interested in learning of any experience other
>>>apache
>>> projects, such as apache or tomcat have with these issues.
>>> >>
>>> >> ---
>>> >> E14 - typing on glass
>>> >>
>>> >> On May 10, 2011, at 6:31 AM, "Steve Loughran" <stevel@apache.org>
>>> wrote:
>>> >>
>>> >>>
>>> >>> Back in Jan 2011, I started a discussion about how to define Apache
>>> >>> Hadoop Compatibility:
>>> >>>
>>> 
>>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4
>>>D46B6AD.2020802@apache.org%3E
>>> >>>
>>> >>> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
>>> >>>
>>> >>>
>>> 
>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1
>>>.pdf
>>> >>>
>>> >>> It claims that their implementations are 100% compatible, even
>>>though
>>> >>> the Enterprise edition uses a C filesystem. It also claims that
>>>both
>>> >>> their software releases contain "Certified Stacks", without
>>>defining
>>> >>> what Certified means, or who does the certification -only that it
>>>is an
>>> >>> improvement.
>>> >>>
>>> >>>
>>> >>> I think we should revisit this issue before people with their own
>>> >>> agendas define what compatibility with Apache Hadoop is for us
>>> >>>
>>> >>>
>>> >>> Licensing
>>> >>> -Use of the Hadoop codebase must follow the Apache License
>>> >>> http://www.apache.org/licenses/LICENSE-2.0
>>> >>> -plug in components that are dynamically linked to (Filesystems
and
>>> >>> schedulers) don't appear to be derivative works on my reading of
>>>this,
>>> >>>
>>> >>> Naming
>>> >>> -this is something for branding@apache, they will have their
>>>opinions.
>>> >>> The key one is that the name "Apache Hadoop" must get used, and
>>>it's
>>> >>> important to make clear it is a derivative work.
>>> >>> -I don't think you can claim to have a Distribution/Fork/Version
of
>>> >>> Apache Hadoop if you swap out big chunks of it for alternate
>>> >>> filesystems, MR engines, etc. Some description of this is needed
>>> >>> "Supports the Apache Hadoop MapReduce engine on top of Filesystem
>>>XYZ"
>>> >>>
>>> >>> Compatibility
>>> >>> -the definition of the Hadoop interfaces and classes is the Apache
>>> >>> Source tree,
>>> >>> -the definition of semantics of the Hadoop interfaces and classes
>>>is
>>> >>> the Apache Source tree, including the test classes.
>>> >>> -the verification that the actual semantics of an Apache Hadoop
>>> >>> release is compatible with the expected semantics is that current
>>>and
>>> >>> future tests pass
>>> >>> -bug reports can highlight incompatibility with expectations of
>>> >>> community users, and once incorporated into tests form part of the
>>> >>> compatibility testing
>>> >>> -vendors can claim and even certify their derivative works as
>>> >>> compatible with other versions of their derivative works, but
>>>cannot
>>> >>> claim compatibility with Apache Hadoop unless their code passes
the
>>> >>> tests and is consistent with the bug reports marked as ("by
>>>design").
>>> >>> Perhaps we should have tests that verify each of these "by design"
>>> >>> bugreps to make them more formal.
>>> >>>
>>> >>> Certification
>>> >>> -I have no idea what this means in EMC's case, they just say
>>> "Certified"
>>> >>> -As we don't do any certification ourselves, it would seem
>>>impossible
>>> >>> for us to certify that any derivative work is compatible.
>>> >>> -It may be best to state that nobody can certify their derivative
>>>as
>>> >>> "compatible with Apache Hadoop" unless it passes all current test
>>> suites
>>> >>> -And require that anyone who declares compatibility define what
>>>they
>>> >>> mean by this
>>> >>>
>>> >>> This is a good argument for getting more functional tests out there
>>> >>> -whoever has more functional tests needs to get them into a test
>>>module
>>> >>> that can be used to test real deployments.
>>> >>>
>>> >
>>> >
>>>
>>


Mime
View raw message