hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Defining Hadoop Compatibility -revisiting-
Date Tue, 10 May 2011 10:29:38 GMT

Back in Jan 2011, I started a discussion about how to define Apache 
Hadoop Compatibility:
http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D46B6AD.2020802@apache.org%3E

I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet

http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1.pdf

It claims that their implementations are 100% compatible, even though 
the Enterprise edition uses a C filesystem. It also claims that both 
their software releases contain "Certified Stacks", without defining 
what Certified means, or who does the certification -only that it is an 
improvement.


I think we should revisit this issue before people with their own 
agendas define what compatibility with Apache Hadoop is for us


Licensing
-Use of the Hadoop codebase must follow the Apache License
http://www.apache.org/licenses/LICENSE-2.0
-plug in components that are dynamically linked to (Filesystems and 
schedulers) don't appear to be derivative works on my reading of this,

Naming
  -this is something for branding@apache, they will have their opinions. 
The key one is that the name "Apache Hadoop" must get used, and it's 
important to make clear it is a derivative work.
  -I don't think you can claim to have a Distribution/Fork/Version of 
Apache Hadoop if you swap out big chunks of it for alternate 
filesystems, MR engines, etc. Some description of this is needed
"Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ"

Compatibility
  -the definition of the Hadoop interfaces and classes is the Apache 
Source tree,
  -the definition of semantics of the Hadoop interfaces and classes is 
the Apache Source tree, including the test classes.
  -the verification that the actual semantics of an Apache Hadoop 
release is compatible with the expected semantics is that current and 
future tests pass
  -bug reports can highlight incompatibility with expectations of 
community users, and once incorporated into tests form part of the 
compatibility testing
  -vendors can claim and even certify their derivative works as 
compatible with other versions of their derivative works, but cannot 
claim compatibility with Apache Hadoop unless their code passes the 
tests and is consistent with the bug reports marked as ("by design"). 
Perhaps we should have tests that verify each of these "by design" 
bugreps to make them more formal.

Certification
  -I have no idea what this means in EMC's case, they just say "Certified"
  -As we don't do any certification ourselves, it would seem impossible 
for us to certify that any derivative work is compatible.
  -It may be best to state that nobody can certify their derivative as 
"compatible with Apache Hadoop" unless it passes all current test suites
  -And require that anyone who declares compatibility define what they 
mean by this

This is a good argument for getting more functional tests out there 
-whoever has more functional tests needs to get them into a test module 
that can be used to test real deployments.


Mime
View raw message