hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@cloudera.com>
Subject Re: upstream jenkins build broken?
Date Mon, 16 Mar 2015 20:52:00 GMT
I'm on it. HADOOP-11721

On Mon, Mar 16, 2015 at 3:44 PM, Haohui Mai <wheat9@apache.org> wrote:

> +1 for git clean.
>
> Colin, can you please get it in ASAP? Currently due to the jenkins
> issues, we cannot close the 2.7 blockers.
>
> Thanks,
> Haohui
>
>
>
> On Mon, Mar 16, 2015 at 11:54 AM, Colin P. McCabe <cmccabe@apache.org>
> wrote:
> > If all it takes is someone creating a test that makes a directory
> > without -x, this is going to happen over and over.
> >
> > Let's just fix the problem at the root by running "git clean -fqdx" in
> > our jenkins scripts.  If there's no objections I will add this in and
> > un-break the builds.
> >
> > best,
> > Colin
> >
> > On Fri, Mar 13, 2015 at 1:48 PM, Lei Xu <lei@cloudera.com> wrote:
> >> I filed HDFS-7917 to change the way to simulate disk failures.
> >>
> >> But I think we still need infrastructure folks to help with jenkins
> >> scripts to clean the dirs left today.
> >>
> >> On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricetons@gmail.com> wrote:
> >>> Any updates on this issues? It seems that all HDFS jenkins builds are
> >>> still failing.
> >>>
> >>> Regards,
> >>> Haohui
> >>>
> >>> On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B <
> vinayakumarb@apache.org> wrote:
> >>>> I think the problem started from here.
> >>>>
> >>>>
> https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/
> >>>>
> >>>> As Chris mentioned TestDataNodeVolumeFailure is changing the
> permission.
> >>>> But in this patch, ReplicationMonitor got NPE and it got terminate
> signal,
> >>>> due to which MiniDFSCluster.shutdown() throwing Exception.
> >>>>
> >>>> But, TestDataNodeVolumeFailure#teardown() is restoring those
> permission
> >>>> after shutting down cluster. So in this case IMO, permissions were
> never
> >>>> restored.
> >>>>
> >>>>
> >>>>   @After
> >>>>   public void tearDown() throws Exception {
> >>>>     if(data_fail != null) {
> >>>>       FileUtil.setWritable(data_fail, true);
> >>>>     }
> >>>>     if(failedDir != null) {
> >>>>       FileUtil.setWritable(failedDir, true);
> >>>>     }
> >>>>     if(cluster != null) {
> >>>>       cluster.shutdown();
> >>>>     }
> >>>>     for (int i = 0; i < 3; i++) {
> >>>>       FileUtil.setExecutable(new File(dataDir, "data"+(2*i+1)), true);
> >>>>       FileUtil.setExecutable(new File(dataDir, "data"+(2*i+2)), true);
> >>>>     }
> >>>>   }
> >>>>
> >>>>
> >>>> Regards,
> >>>> Vinay
> >>>>
> >>>> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar B <
> vinayakumarb@apache.org>
> >>>> wrote:
> >>>>
> >>>>> When I see the history of these kind of builds, All these are failed
> on
> >>>>> node H9.
> >>>>>
> >>>>> I think some or the other uncommitted patch would have created the
> problem
> >>>>> and left it there.
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Vinay
> >>>>>
> >>>>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey <busbey@cloudera.com>
> wrote:
> >>>>>
> >>>>>> You could rely on a destructive git clean call instead of maven
to
> do the
> >>>>>> directory removal.
> >>>>>>
> >>>>>> --
> >>>>>> Sean
> >>>>>> On Mar 11, 2015 4:11 PM, "Colin McCabe" <cmccabe@alumni.cmu.edu>
> wrote:
> >>>>>>
> >>>>>> > Is there a maven plugin or setting we can use to simply
remove
> >>>>>> > directories that have no executable permissions on them?
 Clearly
> we
> >>>>>> > have the permission to do this from a technical point of
view
> (since
> >>>>>> > we created the directories as the jenkins user), it's simply
that
> the
> >>>>>> > code refuses to do it.
> >>>>>> >
> >>>>>> > Otherwise I guess we can just fix those tests...
> >>>>>> >
> >>>>>> > Colin
> >>>>>> >
> >>>>>> > On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu <lei@cloudera.com>
wrote:
> >>>>>> > > Thanks a lot for looking into HDFS-7722, Chris.
> >>>>>> > >
> >>>>>> > > In HDFS-7722:
> >>>>>> > > TestDataNodeVolumeFailureXXX tests reset data dir
permissions in
> >>>>>> > TearDown().
> >>>>>> > > TestDataNodeHotSwapVolumes reset permissions in a
finally
> clause.
> >>>>>> > >
> >>>>>> > > Also I ran mvn test several times on my machine and
all tests
> passed.
> >>>>>> > >
> >>>>>> > > However, since in DiskChecker#checkDirAccess():
> >>>>>> > >
> >>>>>> > > private static void checkDirAccess(File dir) throws
> >>>>>> DiskErrorException {
> >>>>>> > >   if (!dir.isDirectory()) {
> >>>>>> > >     throw new DiskErrorException("Not a directory:
"
> >>>>>> > >                                  + dir.toString());
> >>>>>> > >   }
> >>>>>> > >
> >>>>>> > >   checkAccessByFileMethods(dir);
> >>>>>> > > }
> >>>>>> > >
> >>>>>> > > One potentially safer alternative is replacing data
dir with a
> regular
> >>>>>> > > file to stimulate disk failures.
> >>>>>> > >
> >>>>>> > > On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth <
> >>>>>> cnauroth@hortonworks.com>
> >>>>>> > wrote:
> >>>>>> > >> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure,
> >>>>>> > >> TestDataNodeVolumeFailureReporting, and
> >>>>>> > >> TestDataNodeVolumeFailureToleration all remove
executable
> permissions
> >>>>>> > from
> >>>>>> > >> directories like the one Colin mentioned to simulate
disk
> failures at
> >>>>>> > data
> >>>>>> > >> nodes.  I reviewed the code for all of those,
and they all
> appear to
> >>>>>> be
> >>>>>> > >> doing the necessary work to restore executable
permissions at
> the
> >>>>>> end of
> >>>>>> > >> the test.  The only recent uncommitted patch I¹ve
seen that
> makes
> >>>>>> > changes
> >>>>>> > >> in these test suites is HDFS-7722.  That patch
still looks fine
> >>>>>> > though.  I
> >>>>>> > >> don¹t know if there are other uncommitted patches
that changed
> these
> >>>>>> > test
> >>>>>> > >> suites.
> >>>>>> > >>
> >>>>>> > >> I suppose it¹s also possible that the JUnit process
> unexpectedly died
> >>>>>> > >> after removing executable permissions but before
restoring
> them.
> >>>>>> That
> >>>>>> > >> always would have been a weakness of these test
suites,
> regardless of
> >>>>>> > any
> >>>>>> > >> recent changes.
> >>>>>> > >>
> >>>>>> > >> Chris Nauroth
> >>>>>> > >> Hortonworks
> >>>>>> > >> http://hortonworks.com/
> >>>>>> > >>
> >>>>>> > >>
> >>>>>> > >>
> >>>>>> > >>
> >>>>>> > >>
> >>>>>> > >>
> >>>>>> > >> On 3/10/15, 1:47 PM, "Aaron T. Myers" <atm@cloudera.com>
> wrote:
> >>>>>> > >>
> >>>>>> > >>>Hey Colin,
> >>>>>> > >>>
> >>>>>> > >>>I asked Andrew Bayer, who works with Apache
Infra, what's
> going on
> >>>>>> with
> >>>>>> > >>>these boxes. He took a look and concluded that
some perms are
> being
> >>>>>> set
> >>>>>> > in
> >>>>>> > >>>those directories by our unit tests which are
precluding those
> files
> >>>>>> > from
> >>>>>> > >>>getting deleted. He's going to clean up the
boxes for us, but
> we
> >>>>>> should
> >>>>>> > >>>expect this to keep happening until we can
fix the test in
> question
> >>>>>> to
> >>>>>> > >>>properly clean up after itself.
> >>>>>> > >>>
> >>>>>> > >>>To help narrow down which commit it was that
started this,
> Andrew
> >>>>>> sent
> >>>>>> > me
> >>>>>> > >>>this info:
> >>>>>> > >>>
> >>>>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-
> >>>>>> >
> >>>>>>
> >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/
> >>>>>> > has
> >>>>>> > >>>500 perms, so I'm guessing that's the problem.
Been that way
> since
> >>>>>> 9:32
> >>>>>> > >>>UTC
> >>>>>> > >>>on March 5th."
> >>>>>> > >>>
> >>>>>> > >>>--
> >>>>>> > >>>Aaron T. Myers
> >>>>>> > >>>Software Engineer, Cloudera
> >>>>>> > >>>
> >>>>>> > >>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe
<
> cmccabe@apache.org
> >>>>>> >
> >>>>>> > >>>wrote:
> >>>>>> > >>>
> >>>>>> > >>>> Hi all,
> >>>>>> > >>>>
> >>>>>> > >>>> A very quick (and not thorough) survey
shows that I can't
> find any
> >>>>>> > >>>> jenkins jobs that succeeded from the last
24 hours.  Most of
> them
> >>>>>> seem
> >>>>>> > >>>> to be failing with some variant of this
message:
> >>>>>> > >>>>
> >>>>>> > >>>> [ERROR] Failed to execute goal
> >>>>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean
> >>>>>> (default-clean)
> >>>>>> > >>>> on project hadoop-hdfs: Failed to clean
project: Failed to
> delete
> >>>>>> > >>>>
> >>>>>> > >>>>
> >>>>>> >
> >>>>>> >
> >>>>>>
> >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr
> >>>>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3
> >>>>>> > >>>> -> [Help 1]
> >>>>>> > >>>>
> >>>>>> > >>>> Any ideas how this happened?  Bad disk,
unit test setting
> wrong
> >>>>>> > >>>> permissions?
> >>>>>> > >>>>
> >>>>>> > >>>> Colin
> >>>>>> > >>>>
> >>>>>> > >>
> >>>>>> > >
> >>>>>> > >
> >>>>>> > >
> >>>>>> > > --
> >>>>>> > > Lei (Eddy) Xu
> >>>>>> > > Software Engineer, Cloudera
> >>>>>> >
> >>>>>>
> >>>>>
> >>>>>
> >>
> >>
> >>
> >> --
> >> Lei (Eddy) Xu
> >> Software Engineer, Cloudera
>



-- 
Sean

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message