Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 743B950EE for ; Thu, 12 May 2011 02:26:29 +0000 (UTC) Received: (qmail 63861 invoked by uid 500); 12 May 2011 02:26:28 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 63804 invoked by uid 500); 12 May 2011 02:26:28 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 63796 invoked by uid 99); 12 May 2011 02:26:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 02:26:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mcsrivas@gmail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 02:26:23 +0000 Received: by pzk10 with SMTP id 10so740755pzk.35 for ; Wed, 11 May 2011 19:26:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=6BevyoxSZxR9KfMF/vREoE2rGC94qA40BMpa5pGKnik=; b=U5ALXtxSbOQzvSA3OcVBGBSlO8AQzIs7Nz3yqs/gBzyW7QBO2gQRK1Puh0O+dtTqx2 4BzmmJQrxoJS4Mk+1a1uyEPdQ3HyhUFA1DRsSbA2J6VuxrGz5Gev6Keo6jcBuhpnq9Wk 3vwWayNUeNlfzeHBPIdPVr6rzoXiVXca2/CRI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=QnWQKI0WkabpySHf0Q6IzDBAkYkZTvSuhUgO5r42G8Ep5iP1y1B3OyZ6AaCv7jE6Zt xcVtCcyf1m/ioiA0wqB9HuiRFR0CQMJx236RWg3tzWGWJb4aORWlfSbd+rJV4HZT3EFR oy+An8LJQD/PsKxTDpPprkkElQozBDVnYyPaE= MIME-Version: 1.0 Received: by 10.68.59.169 with SMTP id a9mr3043831pbr.60.1305167163385; Wed, 11 May 2011 19:26:03 -0700 (PDT) Received: by 10.68.56.193 with HTTP; Wed, 11 May 2011 19:26:03 -0700 (PDT) In-Reply-To: References: Date: Wed, 11 May 2011 19:26:03 -0700 Message-ID: Subject: Re: Defining Hadoop Compatibility -revisiting- From: "M. C. Srivas" To: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec53af1682566c504a30ae993 --bcaec53af1682566c504a30ae993 Content-Type: text/plain; charset=ISO-8859-1 While the HCK is a great idea to check quickly if an implementation is "compliant", we still need a written specification to define what is meant by compliance, something akin to a set of RFC's, or a set of docs like the IEEE POSIX specifications. For example, the POSIX.1c pthreads API has a written document that specifies all the function calls, input params, return values, and error codes. It clearly indicates what any POSIX-complaint threads package needs to support, and what are vendor-specific non-portable extensions that one can use at one's own risk. Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the specification is extracted only by looking at the code, or (where the code is non-trivial) by writing really bizarre test programs to examine corner cases. Further, the interaction between a mix of the old and new APIs is not specified anywhere. Such specifications are vitally important when implementing libraries like Cascading, Mahout, etc. For example, an application might open a file using the new API, and pass that stream into a library that manipulates the stream using some of the old API ... what is then the expectation of the state of the stream when the library call returns? Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such things down. There's similar good effort in the Map/Reduce and Avro spaces, but it seems to have stalled somewhat. We should continue it. Doing such specs would be a great service to the community and the users of Hadoop. It provides them (a) clear-cut docs on how to use the Hadoop APIs (b) wider choice of Hadoop implementations by freeing them from vendor lock-in. Once we have such specification, the HCK becomes meaningful (since the HCK itself will be buggy initially). On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar wrote: > I think it's time to separate out functional tests as a "Hadoop > Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL > 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." > > - milind > -- > Milind Bhandarkar > mbhandarkar@linkedin.com > > > > > > > On 5/11/11 2:24 PM, "Eric Baldeschwieler" wrote: > > >This is a really interesting topic! I completely agree that we need to > >get ahead of this. > > > >I would be really interested in learning of any experience other apache > >projects, such as apache or tomcat have with these issues. > > > >--- > >E14 - typing on glass > > > >On May 10, 2011, at 6:31 AM, "Steve Loughran" wrote: > > > >> > >> Back in Jan 2011, I started a discussion about how to define Apache > >> Hadoop Compatibility: > >> > >> > http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D > >>46B6AD.2020802@apache.org%3E > >> > >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet > >> > >> > >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1 > . > >>pdf > >> > >> It claims that their implementations are 100% compatible, even though > >> the Enterprise edition uses a C filesystem. It also claims that both > >> their software releases contain "Certified Stacks", without defining > >> what Certified means, or who does the certification -only that it is an > >> improvement. > >> > >> > >> I think we should revisit this issue before people with their own > >> agendas define what compatibility with Apache Hadoop is for us > >> > >> > >> Licensing > >> -Use of the Hadoop codebase must follow the Apache License > >> http://www.apache.org/licenses/LICENSE-2.0 > >> -plug in components that are dynamically linked to (Filesystems and > >> schedulers) don't appear to be derivative works on my reading of this, > >> > >> Naming > >> -this is something for branding@apache, they will have their opinions. > >> The key one is that the name "Apache Hadoop" must get used, and it's > >> important to make clear it is a derivative work. > >> -I don't think you can claim to have a Distribution/Fork/Version of > >> Apache Hadoop if you swap out big chunks of it for alternate > >> filesystems, MR engines, etc. Some description of this is needed > >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" > >> > >> Compatibility > >> -the definition of the Hadoop interfaces and classes is the Apache > >> Source tree, > >> -the definition of semantics of the Hadoop interfaces and classes is > >> the Apache Source tree, including the test classes. > >> -the verification that the actual semantics of an Apache Hadoop > >> release is compatible with the expected semantics is that current and > >> future tests pass > >> -bug reports can highlight incompatibility with expectations of > >> community users, and once incorporated into tests form part of the > >> compatibility testing > >> -vendors can claim and even certify their derivative works as > >> compatible with other versions of their derivative works, but cannot > >> claim compatibility with Apache Hadoop unless their code passes the > >> tests and is consistent with the bug reports marked as ("by design"). > >> Perhaps we should have tests that verify each of these "by design" > >> bugreps to make them more formal. > >> > >> Certification > >> -I have no idea what this means in EMC's case, they just say > >>"Certified" > >> -As we don't do any certification ourselves, it would seem impossible > >> for us to certify that any derivative work is compatible. > >> -It may be best to state that nobody can certify their derivative as > >> "compatible with Apache Hadoop" unless it passes all current test suites > >> -And require that anyone who declares compatibility define what they > >> mean by this > >> > >> This is a good argument for getting more functional tests out there > >> -whoever has more functional tests needs to get them into a test module > >> that can be used to test real deployments. > >> > > --bcaec53af1682566c504a30ae993--