hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Holsman <had...@holsman.net>
Subject Re: Defining Compatibility
Date Mon, 31 Jan 2011 15:40:41 GMT

On Jan 31, 2011, at 8:18 AM, Steve Loughran wrote:

> what does it mean to be compatible with Hadoop? And how do products that consider themselves
compatible with Hadoop say it?

I would like to define it in terms of API's and core functionality.

A product (say hive or pig) will run against a set of well defined APIs for a given version.
regardless of who implements the API, it should perform as promised, so switching between
distributions or implementations (say HDFS over GFS) should not give the end user any surprises.

Saying that.. HDFS over GFS may expose a superset of APIs that a tool may utilize. 
If the tool requires those APIs it is no longer compatible with Apache Hadoop, and should
not be called such.

For example, early versions of HUE required the Thrift API to be present. So it would clearly
not be compatible version Apache Hadoop 0.20.

What still perplexes me is what to do when some core functionality (say the append patch)
that has a identical API to the end-user is promoted as compatible... 
I classify this change as a 'end user surprise', *BUT* HDFS over GFS would also have similar
surprises, where the API is the same, but implemented very differently.

So I'm still not sure if you would classify it as being 0.20 compatible.

> We have plugin schedulers and the like, and all is well, and the Apache brand people
keep an eye on distributions of the Hadoop code and make sure that Apache Hadoop is cleanly
distinguished from redistributions of binaries by third parties.
> But then you get distributions, and you have to define what is meant in terms of functionality
and compatibility
> Presumably, everyone who issues their own release has either explicitly or implicitly
done a lot more testing than is in the unit test suite, testing that exists to stress test
the code on large clusters -is there stuff there that needs to be added to SVN to help say
a build is of sufficiently quality to be released?
> Then there are the questions about
> -things that work with specific versions/releases of Hadoop?
> -replacement filesystems ?
> -replacement of core parts of the system, like the MapReduce Engine?
> IBM have have been talking about "Hadoop on GPFS"
> http://www.almaden.ibm.com/storagesystems/projects/hadoop/
> If this is running the MR layer, should it say "Apache Hadoop MR engine on top of IBM
GPFS", or what -and how do you define or assess compatibility at this point? Is it up to the
vendor to say "works with Apache Hadoop", and is running the Terasort client code sufficient
to say "compatible"?
> Similarly, if the MapReduce engine gets swapped out, what then? We in HP Labs have been
funding some exploratory work at universities in Berlin on an engine that does more operations
than just map and reduce, but it will also handle the existing operations with API compatibility
on the worker nodes. The goal here is research with an OSS deliverable, but while it may support
Hadoop jobs, it's not Hadoop.
> What to call such things?
> -Steve

View raw message