hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: Defining Hadoop Compatibility -revisiting-
Date Fri, 13 May 2011 05:05:56 GMT
print "+1";
goto label;

I could not agree more with everything you said steve!  The Apache Hadoop project should own
the definition of Apache Hadoop.  Hadoop is far from done.  The interfaces need to keep evolving
to get to a place where we can be proud of them.

I support "vendors" building replacement components for Apache Hadoop components.  That will
benefit the community, give folks choices and challenge us to make Apache Hadoop even better.
 I think it is critical that Apache Hadoop remain a living / evolving work that is driven
by those who are willing to contribute their work to it and that the result of that evolution
is the reference implementation that vendors must match & exceed to play.

I'd love to see more effort to add specifications and compatibility tests to Apache Hadoop.
 We'll continue to invest in specs and see what we can do about tests.  I  encourage folks
who wish to demonstrate compatibility and use the Apache Hadoop trademark with their products
to help contribute such work to Apache Hadoop.  We should include these things with the code
under SVN with our normal patch peer review.

On May 12, 2011, at 2:32 AM, Steve Loughran wrote:

> On 12/05/2011 03:26, M. C. Srivas wrote:
>> While the HCK is a great idea to check quickly if an implementation is
>> "compliant",  we still need a written specification to define what is meant
>> by compliance, something akin to a set of RFC's, or a set of docs like the
>>  IEEE POSIX specifications.
>> For example, the POSIX.1c pthreads API has a written document that specifies
>> all the function calls, input params, return values, and error codes. It
>> clearly indicates what any POSIX-complaint threads package needs to support,
>> and what are vendor-specific non-portable extensions that one can use at
>> one's own risk.
> I have been known to be critical of standards bodies in the past
> http://www.waterfall2006.com/loughran.html
> And I've been in them. It is absolutely essential that the Hadoop stack 
> doesn't become controlled by a standards body, as then you become 
> controlled by whoever can afford to send the most people to the 
> standards events -and make behind the scenes deals with others to get 
> votes through.
>> Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
>> specification is extracted only by looking at the code, or (where the code
>> is non-trivial) by writing really bizarre test programs to examine corner
>> cases. Further, the interaction between a mix of the old and new APIs is not
>> specified anywhere. Such specifications are vitally important when
>> implementing libraries like Cascading, Mahout, etc. For example, an
>> application might open a file using the new API, and pass that stream into a
>> library that manipulates the stream using some of the old API ... what is
>> then the expectation of the state of the stream when the library call
>> returns?
>> Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such
>> things down. There's similar good effort in the Map/Reduce and  Avro spaces,
>> but it seems to have stalled somewhat. We should continue it.
>> Doing such specs would be a great service to the community and the users of
>> Hadoop. It provides them
>>    (a) clear-cut docs on how to use the Hadoop APIs#
> +1
>>    (b) wider choice of Hadoop implementations by freeing them from vendor
>> lock-in.
> =0
> They won't be hadoop implementations, they will be "something that is 
> compatible with the Apache Hadoop API as defined in v 0.x of the Hadoop 
> compatibility kit". Furthermore, there's the issue of any google patents 
> -while google have given Hadoop permission to them, that may not apply 
> to other things that implement compatible APIs.
> I also think that the Hadoop team need to be the one's who own the 
> interfaces and tests, define the tests as a functional test suite for 
> testing Hadoop distributions, and reserve the right to make changes to 
> the interfaces, semantics and tests as suits the teams needs. The input 
> from others -especially related community projects- are important, but, 
> to be ruthless, the compatibility issues with things that aren't really 
> Apache Hadoop are less important. you choose to reimplement Hadoop, you 
> take on the costs of staying current.
>> Once we have such specification, the HCK becomes meaningful (since the HCK
>> itself will be buggy initially).

View raw message