Return-Path: X-Original-To: apmail-hadoop-common-dev-archive@www.apache.org Delivered-To: apmail-hadoop-common-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3FE2B18953 for ; Tue, 16 Jun 2015 15:51:51 +0000 (UTC) Received: (qmail 47952 invoked by uid 500); 16 Jun 2015 15:51:49 -0000 Delivered-To: apmail-hadoop-common-dev-archive@hadoop.apache.org Received: (qmail 47880 invoked by uid 500); 16 Jun 2015 15:51:49 -0000 Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-dev@hadoop.apache.org Received: (qmail 47869 invoked by uid 99); 16 Jun 2015 15:51:49 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Jun 2015 15:51:49 +0000 Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id E75BB1A04B1 for ; Tue, 16 Jun 2015 15:51:48 +0000 (UTC) Received: by igbzc4 with SMTP id zc4so82055314igb.0 for ; Tue, 16 Jun 2015 08:51:48 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.107.39.209 with SMTP id n200mr1411386ion.59.1434469908288; Tue, 16 Jun 2015 08:51:48 -0700 (PDT) Received: by 10.79.83.7 with HTTP; Tue, 16 Jun 2015 08:51:48 -0700 (PDT) In-Reply-To: References: Date: Wed, 17 Jun 2015 00:51:48 +0900 Message-ID: Subject: Re: [DISCUSS] project for pre-commit patch testing (was Re: upstream jenkins build broken?) From: Tsuyoshi Ozawa To: "common-dev@hadoop.apache.org" Cc: dev Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable +1 on the idea. It would be great if tests about dependency management. multiple branches, and distributed environment can be done in the project. One discussion point is how Hadoop depends on Yetus, including the development cycles. It's a good time to rethink what's can be done for making Hadoop better. Thanks, - Tsuyoshi On Tue, Jun 16, 2015 at 8:47 AM, Sean Busbey wrote: > Oof. I had meant to push on this again but life got in the way and now th= e > June board meeting is upon us. Sorry everyone. In the event that this end= s > up contentious, hopefully one of the copied communities can give us a > branch to work in. > > I know everyone is busy, so here's the short version of this email: I'd > like to move some of the code currently in Hadoop (test-patch) into a new > TLP focused on QA tooling. I'm not sure what the best format for priming > this conversation is. ORC filled in the incubator project proposal > template, but I'm not sure how much that confused the issue. So to start, > I'll just write what I'm hoping we can accomplish in general terms here. > > All software development projects that are community based (that is, > accepting outside contributions) face a common QA problem for vetting > in-coming contributions. Hadoop is fortunate enough to be sufficiently > popular that the weight of the problem drove tool development (i.e. > test-patch). That tool is generalizable enough that a bunch of other TLPs > have adopted their own forks. Unfortunately, in most projects this kind o= f > QA work is an enabler rather than a primary concern, so often the tooling > is worked on ad-hoc and little shared improvements happen across > projects. Since > the tooling itself is never a primary concern, any made is rarely reused > outside of ASF projects. > > Over the last couple months a few of us have been working on generalizing > the tooling present in the Hadoop code base (because it was the most matu= re > out of all those in the various projects) and it's reached a point where = we > think we can start bringing on other downstream users. This means we need > to start establishing things like a release cadence and to grow the new > contributors we have to handle more project responsibility. Personally, I > think that means it's time to move out from under Hadoop to drive things = as > our own community. Eventually, I hope the community can help draw in a > group of folks traditionally underrepresented in ASF projects, namely QA > and operations folks. > > I think test-patch by itself has enough scope to justify a project. Havin= g > a solid set of build tools that are customizable to fit the norms of > different software communities is a bunch of work. Making it work well in > both the context of automated test systems like Jenkins and for individua= l > developers is even more work. We could easily also take over maintenance = of > things like shelldocs, since test-patch is the primary consumer of that > currently but it's generally useful tooling. > > In addition to test-patch, I think the proposed project has some future > growth potential. Given some adoption of test-patch to prove utility, the > project could build on the ties it makes to start building tools to help > projects do their own longer-run testing. Note that I'm talking about the > tools to build QA processes and not a particular set of tested components= . > Specifically, I think the ChaosMonkey work that's in HBase should be > generalizable as a fault injection framework (either based on that code o= r > something like it). Doing this for arbitrary software is obviously very > difficult, and a part of easing that will be to make (and then favor) > tooling to allow projects to have operational glue that looks the same. > Namely, the shell work that's been done in hadoop-functions.sh would be a > great foundational layer that could bring good daemon handling practices = to > a whole slew of software projects. In the event that these frameworks and > tools get adopted by parts of the Hadoop ecosystem, that could make the j= ob > of i.e. Bigtop substantially easier. > > I've reached out to a few folks who have been involved in the current > test-patch work or expressed interest in helping out on getting it used i= n > other projects. Right now, the proposed PMC would be (alphabetical by las= t > name): > > * Andrew Bayer (ASF member, incubator pmc, bigtop pmc, flume pmc, jclouds > pmc, sqoop pmc, all around Jenkins expert) > * Sean Busbey (ASF member, accumulo pmc, hbase pmc) > * Nick Dimiduk (hbase pmc, phoenix pmc) > * Chris Nauroth (ASF member, incubator pmc, hadoop pmc) > * Andrew Purtell (ASF member, incubator pmc, bigtop pmc, hbase pmc, > phoenix pmc) > * Allen Wittenauer (hadoop committer) > > That PMC gives us several members and a bunch of folks familiar with the > ASF. Combined with the code already existing in Apache spaces, I think th= at > gives us sufficient justification for a direct board proposal. > > The planned project name is "Apache Yetus". It's an archaic genus of sea > snail and most of our project will be focused on shell scripts. > > N.b.: this does not mean that the Hadoop community would _have_ to rely o= n > the new TLP, but I hope that once we have a release that can be evaluated > there'd be enough benefit to strongly encourage it. > > This has mostly been focused on scope and community issues, and I'd love = to > talk through any feedback on that. Additionally, are there any other poin= ts > folks want to make sure are covered before we have a resolution? > > On Sat, Jun 6, 2015 at 10:43 PM, Sean Busbey wrote: > >> Sorry for the resend. I figured this deserves a [DISCUSS] flag. >> >> >> >> On Sat, Jun 6, 2015 at 10:39 PM, Sean Busbey wrote= : >> >>> Hi Folks! >>> >>> After working on test-patch with other folks for the last few months, I >>> think we've reached the point where we can make the fastest progress >>> towards the goal of a general use pre-commit patch tester by spinning >>> things into a project focused on just that. I think we have a mature en= ough >>> code base and a sufficient fledgling community, so I'm going to put >>> together a tlp proposal. >>> >>> Thanks for the feedback thus far from use within Hadoop. I hope we can >>> continue to make things more useful. >>> >>> -Sean >>> >>> On Wed, Mar 11, 2015 at 5:16 PM, Sean Busbey wrot= e: >>> >>>> HBase's dev-support folder is where the scripts and support files live= . >>>> We've only recently started adding anything to the maven builds that's >>>> specific to jenkins[1]; so far it's diagnostic stuff, but that's where= I'd >>>> add in more if we ran into the same permissions problems y'all are hav= ing. >>>> >>>> There's also our precommit job itself, though it isn't large[2]. AFAIK= , >>>> we don't properly back this up anywhere, we just notify each other of >>>> changes on a particular mail thread[3]. >>>> >>>> [1]: https://github.com/apache/hbase/blob/master/pom.xml#L1687 >>>> [2]: https://builds.apache.org/job/PreCommit-HBASE-Build/ (they're all >>>> read because I just finished fixing "mvn site" running out of permgen) >>>> [3]: http://s.apache.org/NT0 >>>> >>>> >>>> On Wed, Mar 11, 2015 at 4:51 PM, Chris Nauroth >>> > wrote: >>>> >>>>> Sure, thanks Sean! Do we just look in the dev-support folder in the >>>>> HBase >>>>> repo? Is there any additional context we need to be aware of? >>>>> >>>>> Chris Nauroth >>>>> Hortonworks >>>>> http://hortonworks.com/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 3/11/15, 2:44 PM, "Sean Busbey" wrote: >>>>> >>>>> >+dev@hbase >>>>> > >>>>> >HBase has recently been cleaning up our precommit jenkins jobs to ma= ke >>>>> >them >>>>> >more robust. From what I can tell our stuff started off as an earlie= r >>>>> >version of what Hadoop uses for testing. >>>>> > >>>>> >Folks on either side open to an experiment of combining our precommi= t >>>>> >check >>>>> >tooling? In principle we should be looking for the same kinds of >>>>> things. >>>>> > >>>>> >Naturally we'll still need different jenkins jobs to handle differen= t >>>>> >resource needs and we'd need to figure out where stuff eventually >>>>> lives, >>>>> >but that could come later. >>>>> > >>>>> >On Wed, Mar 11, 2015 at 4:34 PM, Chris Nauroth < >>>>> cnauroth@hortonworks.com> >>>>> >wrote: >>>>> > >>>>> >> The only thing I'm aware of is the failOnError option: >>>>> >> >>>>> >> >>>>> >> >>>>> http://maven.apache.org/plugins/maven-clean-plugin/examples/ignoring-= erro >>>>> >>rs >>>>> >> .html >>>>> >> >>>>> >> >>>>> >> I prefer that we don't disable this, because ignoring different >>>>> kinds of >>>>> >> failures could leave our build directories in an indeterminate sta= te. >>>>> >>For >>>>> >> example, we could end up with an old class file on the classpath f= or >>>>> >>test >>>>> >> runs that was supposedly deleted. >>>>> >> >>>>> >> I think it's worth exploring Eddy's suggestion to try simulating >>>>> failure >>>>> >> by placing a file where the code expects to see a directory. That >>>>> might >>>>> >> even let us enable some of these tests that are skipped on Windows= , >>>>> >> because Windows allows access for the owner even after permissions >>>>> have >>>>> >> been stripped. >>>>> >> >>>>> >> Chris Nauroth >>>>> >> Hortonworks >>>>> >> http://hortonworks.com/ >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On 3/11/15, 2:10 PM, "Colin McCabe" wrote= : >>>>> >> >>>>> >> >Is there a maven plugin or setting we can use to simply remove >>>>> >> >directories that have no executable permissions on them? Clearly= we >>>>> >> >have the permission to do this from a technical point of view (si= nce >>>>> >> >we created the directories as the jenkins user), it's simply that >>>>> the >>>>> >> >code refuses to do it. >>>>> >> > >>>>> >> >Otherwise I guess we can just fix those tests... >>>>> >> > >>>>> >> >Colin >>>>> >> > >>>>> >> >On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu wrote: >>>>> >> >> Thanks a lot for looking into HDFS-7722, Chris. >>>>> >> >> >>>>> >> >> In HDFS-7722: >>>>> >> >> TestDataNodeVolumeFailureXXX tests reset data dir permissions i= n >>>>> >> >>TearDown(). >>>>> >> >> TestDataNodeHotSwapVolumes reset permissions in a finally claus= e. >>>>> >> >> >>>>> >> >> Also I ran mvn test several times on my machine and all tests >>>>> passed. >>>>> >> >> >>>>> >> >> However, since in DiskChecker#checkDirAccess(): >>>>> >> >> >>>>> >> >> private static void checkDirAccess(File dir) throws >>>>> >>DiskErrorException { >>>>> >> >> if (!dir.isDirectory()) { >>>>> >> >> throw new DiskErrorException("Not a directory: " >>>>> >> >> + dir.toString()); >>>>> >> >> } >>>>> >> >> >>>>> >> >> checkAccessByFileMethods(dir); >>>>> >> >> } >>>>> >> >> >>>>> >> >> One potentially safer alternative is replacing data dir with a >>>>> >>regular >>>>> >> >> file to stimulate disk failures. >>>>> >> >> >>>>> >> >> On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth >>>>> >> >> wrote: >>>>> >> >>> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure, >>>>> >> >>> TestDataNodeVolumeFailureReporting, and >>>>> >> >>> TestDataNodeVolumeFailureToleration all remove executable >>>>> >>permissions >>>>> >> >>>from >>>>> >> >>> directories like the one Colin mentioned to simulate disk >>>>> failures >>>>> >>at >>>>> >> >>>data >>>>> >> >>> nodes. I reviewed the code for all of those, and they all app= ear >>>>> >>to be >>>>> >> >>> doing the necessary work to restore executable permissions at = the >>>>> >>end >>>>> >> >>>of >>>>> >> >>> the test. The only recent uncommitted patch I=C2=B9ve seen th= at makes >>>>> >> >>>changes >>>>> >> >>> in these test suites is HDFS-7722. That patch still looks fin= e >>>>> >> >>>though. I >>>>> >> >>> don=C2=B9t know if there are other uncommitted patches that ch= anged >>>>> these >>>>> >> >>>test >>>>> >> >>> suites. >>>>> >> >>> >>>>> >> >>> I suppose it=C2=B9s also possible that the JUnit process unexp= ectedly >>>>> >>died >>>>> >> >>> after removing executable permissions but before restoring the= m. >>>>> >>That >>>>> >> >>> always would have been a weakness of these test suites, >>>>> regardless >>>>> >>of >>>>> >> >>>any >>>>> >> >>> recent changes. >>>>> >> >>> >>>>> >> >>> Chris Nauroth >>>>> >> >>> Hortonworks >>>>> >> >>> http://hortonworks.com/ >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> On 3/10/15, 1:47 PM, "Aaron T. Myers" wrote= : >>>>> >> >>> >>>>> >> >>>>Hey Colin, >>>>> >> >>>> >>>>> >> >>>>I asked Andrew Bayer, who works with Apache Infra, what's goin= g >>>>> on >>>>> >>with >>>>> >> >>>>these boxes. He took a look and concluded that some perms are >>>>> being >>>>> >> >>>>set in >>>>> >> >>>>those directories by our unit tests which are precluding those >>>>> files >>>>> >> >>>>from >>>>> >> >>>>getting deleted. He's going to clean up the boxes for us, but = we >>>>> >>should >>>>> >> >>>>expect this to keep happening until we can fix the test in >>>>> question >>>>> >>to >>>>> >> >>>>properly clean up after itself. >>>>> >> >>>> >>>>> >> >>>>To help narrow down which commit it was that started this, And= rew >>>>> >>sent >>>>> >> >>>>me >>>>> >> >>>>this info: >>>>> >> >>>> >>>>> >> >>>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS- >>>>> >> >>>>> >>>>> >>>>>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data= /data3 >>>>> >>>>>>/ >>>>> >> >>>>has >>>>> >> >>>>500 perms, so I'm guessing that's the problem. Been that way >>>>> since >>>>> >>9:32 >>>>> >> >>>>UTC >>>>> >> >>>>on March 5th." >>>>> >> >>>> >>>>> >> >>>>-- >>>>> >> >>>>Aaron T. Myers >>>>> >> >>>>Software Engineer, Cloudera >>>>> >> >>>> >>>>> >> >>>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe >>>>> >> >>>>> >> >>>>wrote: >>>>> >> >>>> >>>>> >> >>>>> Hi all, >>>>> >> >>>>> >>>>> >> >>>>> A very quick (and not thorough) survey shows that I can't fi= nd >>>>> any >>>>> >> >>>>> jenkins jobs that succeeded from the last 24 hours. Most of >>>>> them >>>>> >> >>>>>seem >>>>> >> >>>>> to be failing with some variant of this message: >>>>> >> >>>>> >>>>> >> >>>>> [ERROR] Failed to execute goal >>>>> >> >>>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean >>>>> >>(default-clean) >>>>> >> >>>>> on project hadoop-hdfs: Failed to clean project: Failed to >>>>> delete >>>>> >> >>>>> >>>>> >> >>>>> >>>>> >> >>>>> >>>>> >>>>>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/had= oop-hd >>>>> >>>>>>>fs >>>>> >> >>>>>-pr >>>>> >> >>>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3 >>>>> >> >>>>> -> [Help 1] >>>>> >> >>>>> >>>>> >> >>>>> Any ideas how this happened? Bad disk, unit test setting wr= ong >>>>> >> >>>>> permissions? >>>>> >> >>>>> >>>>> >> >>>>> Colin >>>>> >> >>>>> >>>>> >> >>> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> -- >>>>> >> >> Lei (Eddy) Xu >>>>> >> >> Software Engineer, Cloudera >>>>> >> >>>>> >> >>>>> > >>>>> > >>>>> >-- >>>>> >Sean >>>>> >>>>> >>>> >>>> >>>> -- >>>> Sean >>>> >>> >>> >>> >>> -- >>> Sean >>> >> >> >> >> -- >> Sean >> > > > > -- > Sean