Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-dev@hadoop.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.191.145.13
 which is an MX secondary for hdfs-dev@hadoop.apache.org)
Subject: Re: we need a fix: precommit failures correlate to hdfs patches
MIME-Version: 1.0
From: Chris Nauroth <cnauroth@hortonworks.com>
To: "hdfs-dev@hadoop.apache.org" <hdfs-dev@hadoop.apache.org>
CC: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>
Thread-Topic: we need a fix: precommit failures correlate to hdfs patches
Thread-Index: AQHQhdwcP81PAKN/hUWoZfj/bBiMrp1rqAiAgACtYIA=
Date: Mon, 4 May 2015 21:23:32 +0000
Message-ID: <D16D2F8E.2147D%cnauroth@hortonworks.com>
References: <FC4C777E-6E02-4523-AB82-1F623DAF41F2@altiscale.com>
 <CAGHyZ6KWy8+Qd9stB1=G1+6Y_UcAuP1JmG2p4t5XNMq7x9Xvsw@mail.gmail.com>
In-Reply-To: 
 <CAGHyZ6KWy8+Qd9stB1=G1+6Y_UcAuP1JmG2p4t5XNMq7x9Xvsw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-ID: <B18E770DCA064745B6430C5725E0A871@exch080.serverpod.net>
Content-Transfer-Encoding: quoted-printable

If we suspect long run times are a potential root cause, then another
thing we could try is turning on parallel test execution.  To do that,
we'd add the -Pparallel-tests argument and possibly tune
-DtestsThreadCount=3DN.  (The default for N is 4.)

https://issues.apache.org/jira/browse/HADOOP-9287

This has given some of us significant speed-ups while running tests in our
dev environments.  I haven't tried it in a while though, so we might
surface some test isolation problems, such as if 2 test suites tried to
work in the same directory for data.  We cleaned up a lot of issues like
that before committing the parallel-tests patches, but it's possible new
problems have crept in.

--Chris Nauroth


On 5/3/15, 9:02 PM, "Sean Busbey" <busbey@cloudera.com> wrote:

>The patch artifact directory in the mainline hadoop jenkins jobs are
>outside of the workspace. I'm not sure what, if anything, jenkins
>guarantees about files out of the main workspace.
>
>They all write to ${WORKSPACE}/../patchProcess, which will probably
>collide
>if multiple runs happen on the same machine. They also all blindly move
>that directory at the end of the run.
>
>On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer <aw@altiscale.com> wrote:
>
>>
>>         So, as some may have noticed, I slammed the Jenkins servers over
>> the weekend to get some recent patch test runs in JIRA for the bug bash
>> this week.  I've had a suspicion for a while now that either the long
>>run
>> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the
>>hdfs
>> tests themselves were related to the patch process directory getting
>> removed out from underneath test-patch.
>>
>>         To test the hypothesis, I submitted all of the non-HDFS patches
>>so
>> that they were first in the queue.  Let them run for a very long time.
>> Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues
>> encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The
>>patch
>> artifact directory has been removed! =B3 started to appear here and ther=
e.
>> This seems to provide some evidence that, yes, hdfs unit tests are
>> directory or indirectly related to the failures.
>>
>>         IMO, I think we need to take a serious look at:
>>
>>         * splitting up the hadoop-hdfs module into multiple modules to
>> reduce unit test run times
>>         * checking to see if the pre commit hooks in hdfs are different
>> than the rest (I do know that the YARN bits are different and appear to
>> have some bugs as well)
>>         * increasing the timeout for jenkins job runs
>>
>>         FWIW, I=B9ve also found some minor things here and there with th=
e
>> rewritten test-patch.sh.  JIRAs have been filed.  One critical, one
>>major
>> and a handful of minor things.
>
>
>
>
>--=20
>Sean