hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Thu, 12 May 2011 09:32:46 GMT
On 12/05/2011 03:26, M. C. Srivas wrote:
> While the HCK is a great idea to check quickly if an implementation is
> "compliant",  we still need a written specification to define what is meant
> by compliance, something akin to a set of RFC's, or a set of docs like the
>   IEEE POSIX specifications.
> For example, the POSIX.1c pthreads API has a written document that specifies
> all the function calls, input params, return values, and error codes. It
> clearly indicates what any POSIX-complaint threads package needs to support,
> and what are vendor-specific non-portable extensions that one can use at
> one's own risk.

I have been known to be critical of standards bodies in the past

And I've been in them. It is absolutely essential that the Hadoop stack 
doesn't become controlled by a standards body, as then you become 
controlled by whoever can afford to send the most people to the 
standards events -and make behind the scenes deals with others to get 
votes through.

> Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
> specification is extracted only by looking at the code, or (where the code
> is non-trivial) by writing really bizarre test programs to examine corner
> cases. Further, the interaction between a mix of the old and new APIs is not
> specified anywhere. Such specifications are vitally important when
> implementing libraries like Cascading, Mahout, etc. For example, an
> application might open a file using the new API, and pass that stream into a
> library that manipulates the stream using some of the old API ... what is
> then the expectation of the state of the stream when the library call
> returns?
> Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such
> things down. There's similar good effort in the Map/Reduce and  Avro spaces,
> but it seems to have stalled somewhat. We should continue it.
> Doing such specs would be a great service to the community and the users of
> Hadoop. It provides them
>     (a) clear-cut docs on how to use the Hadoop APIs#


>     (b) wider choice of Hadoop implementations by freeing them from vendor
> lock-in.


They won't be hadoop implementations, they will be "something that is 
compatible with the Apache Hadoop API as defined in v 0.x of the Hadoop 
compatibility kit". Furthermore, there's the issue of any google patents 
-while google have given Hadoop permission to them, that may not apply 
to other things that implement compatible APIs.

I also think that the Hadoop team need to be the one's who own the 
interfaces and tests, define the tests as a functional test suite for 
testing Hadoop distributions, and reserve the right to make changes to 
the interfaces, semantics and tests as suits the teams needs. The input 
from others -especially related community projects- are important, but, 
to be ruthless, the compatibility issues with things that aren't really 
Apache Hadoop are less important. you choose to reimplement Hadoop, you 
take on the costs of staying current.

> Once we have such specification, the HCK becomes meaningful (since the HCK
> itself will be buggy initially).

View raw message