Return-Path: X-Original-To: apmail-hadoop-general-archive@minotaur.apache.org Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 722D75296 for ; Thu, 12 May 2011 22:30:46 +0000 (UTC) Received: (qmail 23230 invoked by uid 500); 12 May 2011 22:30:45 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 23171 invoked by uid 500); 12 May 2011 22:30:45 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 23163 invoked by uid 99); 12 May 2011 22:30:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 22:30:45 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.212.171] (HELO mail-px0-f171.google.com) (209.85.212.171) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2011 22:30:41 +0000 Received: by pxi7 with SMTP id 7so1686818pxi.2 for ; Thu, 12 May 2011 15:30:20 -0700 (PDT) Received: by 10.68.37.36 with SMTP id v4mr1098569pbj.76.1305239420037; Thu, 12 May 2011 15:30:20 -0700 (PDT) MIME-Version: 1.0 Sender: cos@boudnik.org Received: by 10.68.62.230 with HTTP; Thu, 12 May 2011 15:30:00 -0700 (PDT) In-Reply-To: References: From: Konstantin Boudnik Date: Thu, 12 May 2011 15:30:00 -0700 X-Google-Sender-Auth: 9kIhJTaHuYcKfOZPDk3CbMWkt-U Message-ID: Subject: Re: Defining Hadoop Compatibility -revisiting- To: general@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Thu, May 12, 2011 at 09:45, Milind Bhandarkar wrote: > HCK and written specifications are not mutually exclusive. However, given > the evolving nature of Hadoop APIs, functional tests need to evolve as I would actually expand it to 'functional and system tests' because latter are capable of validating inter-component iterations not coverable by functional tests. Cos > well, and having them tied to a "current stable" version is easier to do > than it is to tie the written specifications. > > - milind > > -- > Milind Bhandarkar > mbhandarkar@linkedin.com > +1-650-776-3167 > > > > > > > On 5/11/11 7:26 PM, "M. C. Srivas" wrote: > >>While the HCK is a great idea to check quickly if an implementation is >>"compliant", =A0we still need a written specification to define what is >>meant >>by compliance, something akin to a set of RFC's, or a set of docs like th= e >> IEEE POSIX specifications. >> >>For example, the POSIX.1c pthreads API has a written document that >>specifies >>all the function calls, input params, return values, and error codes. It >>clearly indicates what any POSIX-complaint threads package needs to >>support, >>and what are vendor-specific non-portable extensions that one can use at >>one's own risk. >> >>Currently we have 2 sets of API =A0in the DFS and Map/Reduce layers, and = the >>specification is extracted only by looking at the code, or (where the cod= e >>is non-trivial) by writing really bizarre test programs to examine corner >>cases. Further, the interaction between a mix of the old and new APIs is >>not >>specified anywhere. Such specifications are vitally important when >>implementing libraries like Cascading, Mahout, etc. For example, an >>application might open a file using the new API, and pass that stream >>into a >>library that manipulates the stream using some of the old API ... what is >>then the expectation of the state of the stream when the library call >>returns? >> >>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail >>such >>things down. There's similar good effort in the Map/Reduce and =A0Avro >>spaces, >>but it seems to have stalled somewhat. We should continue it. >> >>Doing such specs would be a great service to the community and the users >>of >>Hadoop. It provides them >> =A0 (a) clear-cut docs on how to use the Hadoop APIs >> =A0 (b) wider choice of Hadoop implementations by freeing them from vend= or >>lock-in. >> >>Once we have such specification, the HCK becomes meaningful (since the HC= K >>itself will be buggy initially). >> >> >>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar >>>> wrote: >> >>> I think it's time to separate out functional tests as a "Hadoop >>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under AS= L >>> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite.= " >>> >>> - milind >>> -- >>> Milind Bhandarkar >>> mbhandarkar@linkedin.com >>> >>> >>> >>> >>> >>> >>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" wrote: >>> >>> >This is a really interesting topic! =A0I completely agree that we need= to >>> >get ahead of this. >>> > >>> >I would be really interested in learning of any experience other apach= e >>> >projects, such as apache or tomcat have with these issues. >>> > >>> >--- >>> >E14 - typing on glass >>> > >>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" >>>wrote: >>> > >>> >> >>> >> Back in Jan 2011, I started a discussion about how to define Apache >>> >> Hadoop Compatibility: >>> >> >>> >> >>> >>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4= D >>> >>46B6AD.2020802@apache.org%3E >>> >> >>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >>> >> >>> >> >>> >>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final= _ >>>>>1 >>> . >>> >>pdf >>> >> >>> >> It claims that their implementations are 100% compatible, even thoug= h >>> >> the Enterprise edition uses a C filesystem. It also claims that both >>> >> their software releases contain "Certified Stacks", without defining >>> >> what Certified means, or who does the certification -only that it is >>>an >>> >> improvement. >>> >> >>> >> >>> >> I think we should revisit this issue before people with their own >>> >> agendas define what compatibility with Apache Hadoop is for us >>> >> >>> >> >>> >> Licensing >>> >> -Use of the Hadoop codebase must follow the Apache License >>> >> http://www.apache.org/licenses/LICENSE-2.0 >>> >> -plug in components that are dynamically linked to (Filesystems and >>> >> schedulers) don't appear to be derivative works on my reading of >>>this, >>> >> >>> >> Naming >>> >> =A0-this is something for branding@apache, they will have their >>>opinions. >>> >> The key one is that the name "Apache Hadoop" must get used, and it's >>> >> important to make clear it is a derivative work. >>> >> =A0-I don't think you can claim to have a Distribution/Fork/Version = of >>> >> Apache Hadoop if you swap out big chunks of it for alternate >>> >> filesystems, MR engines, etc. Some description of this is needed >>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem >>>XYZ" >>> >> >>> >> Compatibility >>> >> =A0-the definition of the Hadoop interfaces and classes is the Apach= e >>> >> Source tree, >>> >> =A0-the definition of semantics of the Hadoop interfaces and classes= is >>> >> the Apache Source tree, including the test classes. >>> >> =A0-the verification that the actual semantics of an Apache Hadoop >>> >> release is compatible with the expected semantics is that current an= d >>> >> future tests pass >>> >> =A0-bug reports can highlight incompatibility with expectations of >>> >> community users, and once incorporated into tests form part of the >>> >> compatibility testing >>> >> =A0-vendors can claim and even certify their derivative works as >>> >> compatible with other versions of their derivative works, but cannot >>> >> claim compatibility with Apache Hadoop unless their code passes the >>> >> tests and is consistent with the bug reports marked as ("by design")= . >>> >> Perhaps we should have tests that verify each of these "by design" >>> >> bugreps to make them more formal. >>> >> >>> >> Certification >>> >> =A0-I have no idea what this means in EMC's case, they just say >>> >>"Certified" >>> >> =A0-As we don't do any certification ourselves, it would seem >>>impossible >>> >> for us to certify that any derivative work is compatible. >>> >> =A0-It may be best to state that nobody can certify their derivative= as >>> >> "compatible with Apache Hadoop" unless it passes all current test >>>suites >>> >> =A0-And require that anyone who declares compatibility define what t= hey >>> >> mean by this >>> >> >>> >> This is a good argument for getting more functional tests out there >>> >> -whoever has more functional tests needs to get them into a test >>>module >>> >> that can be used to test real deployments. >>> >> >>> >>> > >