hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lei Xu <...@cloudera.com>
Subject Re: upstream jenkins build broken?
Date Fri, 13 Mar 2015 20:48:10 GMT
I filed HDFS-7917 to change the way to simulate disk failures.

But I think we still need infrastructure folks to help with jenkins
scripts to clean the dirs left today.

On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricetons@gmail.com> wrote:
> Any updates on this issues? It seems that all HDFS jenkins builds are
> still failing.
>
> Regards,
> Haohui
>
> On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B <vinayakumarb@apache.org> wrote:
>> I think the problem started from here.
>>
>> https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/
>>
>> As Chris mentioned TestDataNodeVolumeFailure is changing the permission.
>> But in this patch, ReplicationMonitor got NPE and it got terminate signal,
>> due to which MiniDFSCluster.shutdown() throwing Exception.
>>
>> But, TestDataNodeVolumeFailure#teardown() is restoring those permission
>> after shutting down cluster. So in this case IMO, permissions were never
>> restored.
>>
>>
>>   @After
>>   public void tearDown() throws Exception {
>>     if(data_fail != null) {
>>       FileUtil.setWritable(data_fail, true);
>>     }
>>     if(failedDir != null) {
>>       FileUtil.setWritable(failedDir, true);
>>     }
>>     if(cluster != null) {
>>       cluster.shutdown();
>>     }
>>     for (int i = 0; i < 3; i++) {
>>       FileUtil.setExecutable(new File(dataDir, "data"+(2*i+1)), true);
>>       FileUtil.setExecutable(new File(dataDir, "data"+(2*i+2)), true);
>>     }
>>   }
>>
>>
>> Regards,
>> Vinay
>>
>> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar B <vinayakumarb@apache.org>
>> wrote:
>>
>>> When I see the history of these kind of builds, All these are failed on
>>> node H9.
>>>
>>> I think some or the other uncommitted patch would have created the problem
>>> and left it there.
>>>
>>>
>>> Regards,
>>> Vinay
>>>
>>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey <busbey@cloudera.com> wrote:
>>>
>>>> You could rely on a destructive git clean call instead of maven to do the
>>>> directory removal.
>>>>
>>>> --
>>>> Sean
>>>> On Mar 11, 2015 4:11 PM, "Colin McCabe" <cmccabe@alumni.cmu.edu> wrote:
>>>>
>>>> > Is there a maven plugin or setting we can use to simply remove
>>>> > directories that have no executable permissions on them?  Clearly we
>>>> > have the permission to do this from a technical point of view (since
>>>> > we created the directories as the jenkins user), it's simply that the
>>>> > code refuses to do it.
>>>> >
>>>> > Otherwise I guess we can just fix those tests...
>>>> >
>>>> > Colin
>>>> >
>>>> > On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu <lei@cloudera.com> wrote:
>>>> > > Thanks a lot for looking into HDFS-7722, Chris.
>>>> > >
>>>> > > In HDFS-7722:
>>>> > > TestDataNodeVolumeFailureXXX tests reset data dir permissions in
>>>> > TearDown().
>>>> > > TestDataNodeHotSwapVolumes reset permissions in a finally clause.
>>>> > >
>>>> > > Also I ran mvn test several times on my machine and all tests passed.
>>>> > >
>>>> > > However, since in DiskChecker#checkDirAccess():
>>>> > >
>>>> > > private static void checkDirAccess(File dir) throws
>>>> DiskErrorException {
>>>> > >   if (!dir.isDirectory()) {
>>>> > >     throw new DiskErrorException("Not a directory: "
>>>> > >                                  + dir.toString());
>>>> > >   }
>>>> > >
>>>> > >   checkAccessByFileMethods(dir);
>>>> > > }
>>>> > >
>>>> > > One potentially safer alternative is replacing data dir with a
regular
>>>> > > file to stimulate disk failures.
>>>> > >
>>>> > > On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth <
>>>> cnauroth@hortonworks.com>
>>>> > wrote:
>>>> > >> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure,
>>>> > >> TestDataNodeVolumeFailureReporting, and
>>>> > >> TestDataNodeVolumeFailureToleration all remove executable permissions
>>>> > from
>>>> > >> directories like the one Colin mentioned to simulate disk failures
at
>>>> > data
>>>> > >> nodes.  I reviewed the code for all of those, and they all
appear to
>>>> be
>>>> > >> doing the necessary work to restore executable permissions
at the
>>>> end of
>>>> > >> the test.  The only recent uncommitted patch I¹ve seen that
makes
>>>> > changes
>>>> > >> in these test suites is HDFS-7722.  That patch still looks
fine
>>>> > though.  I
>>>> > >> don¹t know if there are other uncommitted patches that changed
these
>>>> > test
>>>> > >> suites.
>>>> > >>
>>>> > >> I suppose it¹s also possible that the JUnit process unexpectedly
died
>>>> > >> after removing executable permissions but before restoring
them.
>>>> That
>>>> > >> always would have been a weakness of these test suites, regardless
of
>>>> > any
>>>> > >> recent changes.
>>>> > >>
>>>> > >> Chris Nauroth
>>>> > >> Hortonworks
>>>> > >> http://hortonworks.com/
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >>
>>>> > >> On 3/10/15, 1:47 PM, "Aaron T. Myers" <atm@cloudera.com>
wrote:
>>>> > >>
>>>> > >>>Hey Colin,
>>>> > >>>
>>>> > >>>I asked Andrew Bayer, who works with Apache Infra, what's
going on
>>>> with
>>>> > >>>these boxes. He took a look and concluded that some perms
are being
>>>> set
>>>> > in
>>>> > >>>those directories by our unit tests which are precluding
those files
>>>> > from
>>>> > >>>getting deleted. He's going to clean up the boxes for us,
but we
>>>> should
>>>> > >>>expect this to keep happening until we can fix the test
in question
>>>> to
>>>> > >>>properly clean up after itself.
>>>> > >>>
>>>> > >>>To help narrow down which commit it was that started this,
Andrew
>>>> sent
>>>> > me
>>>> > >>>this info:
>>>> > >>>
>>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-
>>>> >
>>>> >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/
>>>> > has
>>>> > >>>500 perms, so I'm guessing that's the problem. Been that
way since
>>>> 9:32
>>>> > >>>UTC
>>>> > >>>on March 5th."
>>>> > >>>
>>>> > >>>--
>>>> > >>>Aaron T. Myers
>>>> > >>>Software Engineer, Cloudera
>>>> > >>>
>>>> > >>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe <cmccabe@apache.org
>>>> >
>>>> > >>>wrote:
>>>> > >>>
>>>> > >>>> Hi all,
>>>> > >>>>
>>>> > >>>> A very quick (and not thorough) survey shows that I
can't find any
>>>> > >>>> jenkins jobs that succeeded from the last 24 hours.
 Most of them
>>>> seem
>>>> > >>>> to be failing with some variant of this message:
>>>> > >>>>
>>>> > >>>> [ERROR] Failed to execute goal
>>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean
>>>> (default-clean)
>>>> > >>>> on project hadoop-hdfs: Failed to clean project: Failed
to delete
>>>> > >>>>
>>>> > >>>>
>>>> >
>>>> >
>>>> >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr
>>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3
>>>> > >>>> -> [Help 1]
>>>> > >>>>
>>>> > >>>> Any ideas how this happened?  Bad disk, unit test setting
wrong
>>>> > >>>> permissions?
>>>> > >>>>
>>>> > >>>> Colin
>>>> > >>>>
>>>> > >>
>>>> > >
>>>> > >
>>>> > >
>>>> > > --
>>>> > > Lei (Eddy) Xu
>>>> > > Software Engineer, Cloudera
>>>> >
>>>>
>>>
>>>



-- 
Lei (Eddy) Xu
Software Engineer, Cloudera

Mime
View raw message