Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 92B3E18821 for ; Mon, 4 May 2015 21:24:24 +0000 (UTC) Received: (qmail 58703 invoked by uid 500); 4 May 2015 21:24:21 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 58508 invoked by uid 500); 4 May 2015 21:24:21 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 58486 invoked by uid 99); 4 May 2015 21:24:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 21:24:21 +0000 X-ASF-Spam-Status: No, hits=2.0 required=5.0 tests=FSL_HELO_BARE_IP_2,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.191.145.13 which is an MX secondary for hdfs-dev@hadoop.apache.org) Received: from [54.191.145.13] (HELO mx1-us-west.apache.org) (54.191.145.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 21:24:16 +0000 Received: from relayvx12c.securemail.intermedia.net (relayvx12c.securemail.intermedia.net [64.78.52.187]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 08BBE24B3F; Mon, 4 May 2015 21:23:55 +0000 (UTC) Received: from securemail.intermedia.net (localhost [127.0.0.1]) by emg-ca-1-2.localdomain (Postfix) with ESMTP id 0800253E1D; Mon, 4 May 2015 14:23:33 -0700 (PDT) Subject: Re: we need a fix: precommit failures correlate to hdfs patches MIME-Version: 1.0 x-echoworx-emg-received: Mon, 4 May 2015 14:23:33.020 -0700 x-echoworx-msg-id: b87f09e5-5cf8-4386-ac99-f08343e2a45c x-echoworx-action: delivered Received: from 10.254.155.17 ([10.254.155.17]) by emg-ca-1-2 (JAMES SMTP Server 2.3.2) with SMTP ID 774; Mon, 4 May 2015 14:23:33 -0700 (PDT) Received: from MBX080-W4-CO-2.exch080.serverpod.net (unknown [10.224.117.102]) by emg-ca-1-2.localdomain (Postfix) with ESMTP id C60DF53E1D; Mon, 4 May 2015 14:23:32 -0700 (PDT) Received: from MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) by MBX080-W4-CO-2.exch080.serverpod.net (10.224.117.102) with Microsoft SMTP Server (TLS) id 15.0.1044.25; Mon, 4 May 2015 14:23:32 -0700 Received: from MBX080-W4-CO-2.exch080.serverpod.net ([10.224.117.102]) by mbx080-w4-co-2.exch080.serverpod.net ([10.224.117.102]) with mapi id 15.00.1044.021; Mon, 4 May 2015 14:23:32 -0700 From: Chris Nauroth To: "hdfs-dev@hadoop.apache.org" CC: "common-dev@hadoop.apache.org" Thread-Topic: we need a fix: precommit failures correlate to hdfs patches Thread-Index: AQHQhdwcP81PAKN/hUWoZfj/bBiMrp1rqAiAgACtYIA= Date: Mon, 4 May 2015 21:23:32 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [50.181.140.32] x-source-routing-agent: Processed Content-Type: text/plain; charset="iso-8859-1" Content-ID: Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org If we suspect long run times are a potential root cause, then another thing we could try is turning on parallel test execution. To do that, we'd add the -Pparallel-tests argument and possibly tune -DtestsThreadCount=3DN. (The default for N is 4.) https://issues.apache.org/jira/browse/HADOOP-9287 This has given some of us significant speed-ups while running tests in our dev environments. I haven't tried it in a while though, so we might surface some test isolation problems, such as if 2 test suites tried to work in the same directory for data. We cleaned up a lot of issues like that before committing the parallel-tests patches, but it's possible new problems have crept in. --Chris Nauroth On 5/3/15, 9:02 PM, "Sean Busbey" wrote: >The patch artifact directory in the mainline hadoop jenkins jobs are >outside of the workspace. I'm not sure what, if anything, jenkins >guarantees about files out of the main workspace. > >They all write to ${WORKSPACE}/../patchProcess, which will probably >collide >if multiple runs happen on the same machine. They also all blindly move >that directory at the end of the run. > >On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer wrote: > >> >> So, as some may have noticed, I slammed the Jenkins servers over >> the weekend to get some recent patch test runs in JIRA for the bug bash >> this week. I've had a suspicion for a while now that either the long >>run >> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the >>hdfs >> tests themselves were related to the patch process directory getting >> removed out from underneath test-patch. >> >> To test the hypothesis, I submitted all of the non-HDFS patches >>so >> that they were first in the queue. Let them run for a very long time. >> Jenkins bounced back and forth between YARN, MR, and HADOOP. No issues >> encounters. Added HDFS patches into the mix. BOOM. The dreaded "The >>patch >> artifact directory has been removed! =B3 started to appear here and ther= e. >> This seems to provide some evidence that, yes, hdfs unit tests are >> directory or indirectly related to the failures. >> >> IMO, I think we need to take a serious look at: >> >> * splitting up the hadoop-hdfs module into multiple modules to >> reduce unit test run times >> * checking to see if the pre commit hooks in hdfs are different >> than the rest (I do know that the YARN bits are different and appear to >> have some bugs as well) >> * increasing the timeout for jenkins job runs >> >> FWIW, I=B9ve also found some minor things here and there with th= e >> rewritten test-patch.sh. JIRAs have been filed. One critical, one >>major >> and a handful of minor things. > > > > >--=20 >Sean