hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinayakumar B <vinayakum...@apache.org>
Subject Re: upstream jenkins build broken?
Date Wed, 18 Mar 2015 02:22:08 GMT
This problems seems to be gone atleast for now.
Have made a temp ( as of now) commit to restore the execute permissions for
hadoop-hdfs/target/test/data directory.

Problem was often seen on H9 node. But now multiple builds executed on this
node.

Regards,
Vinay

On Tue, Mar 17, 2015 at 9:53 PM, Vinayakumar B <vinayakumarb@apache.org>
wrote:

> Yes, Just create some directory with some contents in it within target
> directory. And set permission to 600.
> Then can run either 'mvn clean' or 'git clean'
>
> -Vinay
>
> On Tue, Mar 17, 2015 at 9:13 PM, Sean Busbey <busbey@cloudera.com> wrote:
>
>> Is the simulation just removing the executable bit on the directory? I'd
>> like to get something I can reproduce locally.
>>
>> On Tue, Mar 17, 2015 at 2:29 AM, Vinayakumar B <vinayakumarb@apache.org>
>> wrote:
>>
>> > I have simulated the problem in my env and verified that, both 'git
>> clean
>> > -xdf' and 'mvn clean' will not remove the directory.
>> > mvn fails where as git simply ignores (not even display any warning) the
>> > problem.
>> >
>> >
>> >
>> > Regards,
>> > Vinay
>> >
>> > On Tue, Mar 17, 2015 at 2:32 AM, Sean Busbey <busbey@cloudera.com>
>> wrote:
>> >
>> > > Can someone point me to an example build that is broken?
>> > >
>> > > On Mon, Mar 16, 2015 at 3:52 PM, Sean Busbey <busbey@cloudera.com>
>> > wrote:
>> > >
>> > > > I'm on it. HADOOP-11721
>> > > >
>> > > > On Mon, Mar 16, 2015 at 3:44 PM, Haohui Mai <wheat9@apache.org>
>> wrote:
>> > > >
>> > > >> +1 for git clean.
>> > > >>
>> > > >> Colin, can you please get it in ASAP? Currently due to the jenkins
>> > > >> issues, we cannot close the 2.7 blockers.
>> > > >>
>> > > >> Thanks,
>> > > >> Haohui
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Mon, Mar 16, 2015 at 11:54 AM, Colin P. McCabe <
>> cmccabe@apache.org
>> > >
>> > > >> wrote:
>> > > >> > If all it takes is someone creating a test that makes a directory
>> > > >> > without -x, this is going to happen over and over.
>> > > >> >
>> > > >> > Let's just fix the problem at the root by running "git clean
>> -fqdx"
>> > in
>> > > >> > our jenkins scripts.  If there's no objections I will add
this in
>> > and
>> > > >> > un-break the builds.
>> > > >> >
>> > > >> > best,
>> > > >> > Colin
>> > > >> >
>> > > >> > On Fri, Mar 13, 2015 at 1:48 PM, Lei Xu <lei@cloudera.com>
>> wrote:
>> > > >> >> I filed HDFS-7917 to change the way to simulate disk
failures.
>> > > >> >>
>> > > >> >> But I think we still need infrastructure folks to help
with
>> jenkins
>> > > >> >> scripts to clean the dirs left today.
>> > > >> >>
>> > > >> >> On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricetons@gmail.com
>> >
>> > > >> wrote:
>> > > >> >>> Any updates on this issues? It seems that all HDFS
jenkins
>> builds
>> > > are
>> > > >> >>> still failing.
>> > > >> >>>
>> > > >> >>> Regards,
>> > > >> >>> Haohui
>> > > >> >>>
>> > > >> >>> On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B <
>> > > >> vinayakumarb@apache.org> wrote:
>> > > >> >>>> I think the problem started from here.
>> > > >> >>>>
>> > > >> >>>>
>> > > >>
>> > >
>> >
>> https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/
>> > > >> >>>>
>> > > >> >>>> As Chris mentioned TestDataNodeVolumeFailure
is changing the
>> > > >> permission.
>> > > >> >>>> But in this patch, ReplicationMonitor got NPE
and it got
>> > terminate
>> > > >> signal,
>> > > >> >>>> due to which MiniDFSCluster.shutdown() throwing
Exception.
>> > > >> >>>>
>> > > >> >>>> But, TestDataNodeVolumeFailure#teardown() is
restoring those
>> > > >> permission
>> > > >> >>>> after shutting down cluster. So in this case
IMO, permissions
>> > were
>> > > >> never
>> > > >> >>>> restored.
>> > > >> >>>>
>> > > >> >>>>
>> > > >> >>>>   @After
>> > > >> >>>>   public void tearDown() throws Exception {
>> > > >> >>>>     if(data_fail != null) {
>> > > >> >>>>       FileUtil.setWritable(data_fail, true);
>> > > >> >>>>     }
>> > > >> >>>>     if(failedDir != null) {
>> > > >> >>>>       FileUtil.setWritable(failedDir, true);
>> > > >> >>>>     }
>> > > >> >>>>     if(cluster != null) {
>> > > >> >>>>       cluster.shutdown();
>> > > >> >>>>     }
>> > > >> >>>>     for (int i = 0; i < 3; i++) {
>> > > >> >>>>       FileUtil.setExecutable(new File(dataDir,
>> "data"+(2*i+1)),
>> > > >> true);
>> > > >> >>>>       FileUtil.setExecutable(new File(dataDir,
>> "data"+(2*i+2)),
>> > > >> true);
>> > > >> >>>>     }
>> > > >> >>>>   }
>> > > >> >>>>
>> > > >> >>>>
>> > > >> >>>> Regards,
>> > > >> >>>> Vinay
>> > > >> >>>>
>> > > >> >>>> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar
B <
>> > > >> vinayakumarb@apache.org>
>> > > >> >>>> wrote:
>> > > >> >>>>
>> > > >> >>>>> When I see the history of these kind of builds,
All these are
>> > > >> failed on
>> > > >> >>>>> node H9.
>> > > >> >>>>>
>> > > >> >>>>> I think some or the other uncommitted patch
would have
>> created
>> > the
>> > > >> problem
>> > > >> >>>>> and left it there.
>> > > >> >>>>>
>> > > >> >>>>>
>> > > >> >>>>> Regards,
>> > > >> >>>>> Vinay
>> > > >> >>>>>
>> > > >> >>>>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey
<
>> > busbey@cloudera.com
>> > > >
>> > > >> wrote:
>> > > >> >>>>>
>> > > >> >>>>>> You could rely on a destructive git clean
call instead of
>> maven
>> > > to
>> > > >> do the
>> > > >> >>>>>> directory removal.
>> > > >> >>>>>>
>> > > >> >>>>>> --
>> > > >> >>>>>> Sean
>> > > >> >>>>>> On Mar 11, 2015 4:11 PM, "Colin McCabe"
<
>> > cmccabe@alumni.cmu.edu>
>> > > >> wrote:
>> > > >> >>>>>>
>> > > >> >>>>>> > Is there a maven plugin or setting
we can use to simply
>> > remove
>> > > >> >>>>>> > directories that have no executable
permissions on them?
>> > > >> Clearly we
>> > > >> >>>>>> > have the permission to do this from
a technical point of
>> view
>> > > >> (since
>> > > >> >>>>>> > we created the directories as the
jenkins user), it's
>> simply
>> > > >> that the
>> > > >> >>>>>> > code refuses to do it.
>> > > >> >>>>>> >
>> > > >> >>>>>> > Otherwise I guess we can just fix
those tests...
>> > > >> >>>>>> >
>> > > >> >>>>>> > Colin
>> > > >> >>>>>> >
>> > > >> >>>>>> > On Tue, Mar 10, 2015 at 2:43 PM,
Lei Xu <lei@cloudera.com
>> >
>> > > >> wrote:
>> > > >> >>>>>> > > Thanks a lot for looking into
HDFS-7722, Chris.
>> > > >> >>>>>> > >
>> > > >> >>>>>> > > In HDFS-7722:
>> > > >> >>>>>> > > TestDataNodeVolumeFailureXXX
tests reset data dir
>> > permissions
>> > > >> in
>> > > >> >>>>>> > TearDown().
>> > > >> >>>>>> > > TestDataNodeHotSwapVolumes
reset permissions in a
>> finally
>> > > >> clause.
>> > > >> >>>>>> > >
>> > > >> >>>>>> > > Also I ran mvn test several
times on my machine and all
>> > tests
>> > > >> passed.
>> > > >> >>>>>> > >
>> > > >> >>>>>> > > However, since in DiskChecker#checkDirAccess():
>> > > >> >>>>>> > >
>> > > >> >>>>>> > > private static void checkDirAccess(File
dir) throws
>> > > >> >>>>>> DiskErrorException {
>> > > >> >>>>>> > >   if (!dir.isDirectory()) {
>> > > >> >>>>>> > >     throw new DiskErrorException("Not
a directory: "
>> > > >> >>>>>> > >                           
      + dir.toString());
>> > > >> >>>>>> > >   }
>> > > >> >>>>>> > >
>> > > >> >>>>>> > >   checkAccessByFileMethods(dir);
>> > > >> >>>>>> > > }
>> > > >> >>>>>> > >
>> > > >> >>>>>> > > One potentially safer alternative
is replacing data dir
>> > with
>> > > a
>> > > >> regular
>> > > >> >>>>>> > > file to stimulate disk failures.
>> > > >> >>>>>> > >
>> > > >> >>>>>> > > On Tue, Mar 10, 2015 at 2:19
PM, Chris Nauroth <
>> > > >> >>>>>> cnauroth@hortonworks.com>
>> > > >> >>>>>> > wrote:
>> > > >> >>>>>> > >> TestDataNodeHotSwapVolumes,
TestDataNodeVolumeFailure,
>> > > >> >>>>>> > >> TestDataNodeVolumeFailureReporting,
and
>> > > >> >>>>>> > >> TestDataNodeVolumeFailureToleration
all remove
>> executable
>> > > >> permissions
>> > > >> >>>>>> > from
>> > > >> >>>>>> > >> directories like the one
Colin mentioned to simulate
>> disk
>> > > >> failures at
>> > > >> >>>>>> > data
>> > > >> >>>>>> > >> nodes.  I reviewed the
code for all of those, and they
>> all
>> > > >> appear to
>> > > >> >>>>>> be
>> > > >> >>>>>> > >> doing the necessary work
to restore executable
>> permissions
>> > > at
>> > > >> the
>> > > >> >>>>>> end of
>> > > >> >>>>>> > >> the test.  The only recent
uncommitted patch I¹ve seen
>> > that
>> > > >> makes
>> > > >> >>>>>> > changes
>> > > >> >>>>>> > >> in these test suites is
HDFS-7722.  That patch still
>> looks
>> > > >> fine
>> > > >> >>>>>> > though.  I
>> > > >> >>>>>> > >> don¹t know if there are
other uncommitted patches that
>> > > >> changed these
>> > > >> >>>>>> > test
>> > > >> >>>>>> > >> suites.
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >> I suppose it¹s also possible
that the JUnit process
>> > > >> unexpectedly died
>> > > >> >>>>>> > >> after removing executable
permissions but before
>> restoring
>> > > >> them.
>> > > >> >>>>>> That
>> > > >> >>>>>> > >> always would have been
a weakness of these test suites,
>> > > >> regardless of
>> > > >> >>>>>> > any
>> > > >> >>>>>> > >> recent changes.
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >> Chris Nauroth
>> > > >> >>>>>> > >> Hortonworks
>> > > >> >>>>>> > >> http://hortonworks.com/
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >> On 3/10/15, 1:47 PM, "Aaron
T. Myers" <
>> atm@cloudera.com>
>> > > >> wrote:
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >>>Hey Colin,
>> > > >> >>>>>> > >>>
>> > > >> >>>>>> > >>>I asked Andrew Bayer,
who works with Apache Infra,
>> what's
>> > > >> going on
>> > > >> >>>>>> with
>> > > >> >>>>>> > >>>these boxes. He took
a look and concluded that some
>> perms
>> > > are
>> > > >> being
>> > > >> >>>>>> set
>> > > >> >>>>>> > in
>> > > >> >>>>>> > >>>those directories by
our unit tests which are
>> precluding
>> > > >> those files
>> > > >> >>>>>> > from
>> > > >> >>>>>> > >>>getting deleted. He's
going to clean up the boxes for
>> us,
>> > > but
>> > > >> we
>> > > >> >>>>>> should
>> > > >> >>>>>> > >>>expect this to keep
happening until we can fix the
>> test in
>> > > >> question
>> > > >> >>>>>> to
>> > > >> >>>>>> > >>>properly clean up after
itself.
>> > > >> >>>>>> > >>>
>> > > >> >>>>>> > >>>To help narrow down
which commit it was that started
>> this,
>> > > >> Andrew
>> > > >> >>>>>> sent
>> > > >> >>>>>> > me
>> > > >> >>>>>> > >>>this info:
>> > > >> >>>>>> > >>>
>> > > >> >>>>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-
>> > > >> >>>>>> >
>> > > >> >>>>>>
>> > > >>
>> > >
>> >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/
>> > > >> >>>>>> > has
>> > > >> >>>>>> > >>>500 perms, so I'm guessing
that's the problem. Been
>> that
>> > way
>> > > >> since
>> > > >> >>>>>> 9:32
>> > > >> >>>>>> > >>>UTC
>> > > >> >>>>>> > >>>on March 5th."
>> > > >> >>>>>> > >>>
>> > > >> >>>>>> > >>>--
>> > > >> >>>>>> > >>>Aaron T. Myers
>> > > >> >>>>>> > >>>Software Engineer, Cloudera
>> > > >> >>>>>> > >>>
>> > > >> >>>>>> > >>>On Tue, Mar 10, 2015
at 1:24 PM, Colin P. McCabe <
>> > > >> cmccabe@apache.org
>> > > >> >>>>>> >
>> > > >> >>>>>> > >>>wrote:
>> > > >> >>>>>> > >>>
>> > > >> >>>>>> > >>>> Hi all,
>> > > >> >>>>>> > >>>>
>> > > >> >>>>>> > >>>> A very quick (and
not thorough) survey shows that I
>> > can't
>> > > >> find any
>> > > >> >>>>>> > >>>> jenkins jobs that
succeeded from the last 24 hours.
>> > Most
>> > > >> of them
>> > > >> >>>>>> seem
>> > > >> >>>>>> > >>>> to be failing with
some variant of this message:
>> > > >> >>>>>> > >>>>
>> > > >> >>>>>> > >>>> [ERROR] Failed
to execute goal
>> > > >> >>>>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean
>> > > >> >>>>>> (default-clean)
>> > > >> >>>>>> > >>>> on project hadoop-hdfs:
Failed to clean project:
>> Failed
>> > to
>> > > >> delete
>> > > >> >>>>>> > >>>>
>> > > >> >>>>>> > >>>>
>> > > >> >>>>>> >
>> > > >> >>>>>> >
>> > > >> >>>>>>
>> > > >>
>> > >
>> >
>> >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr
>> > > >> >>>>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3
>> > > >> >>>>>> > >>>> -> [Help 1]
>> > > >> >>>>>> > >>>>
>> > > >> >>>>>> > >>>> Any ideas how this
happened?  Bad disk, unit test
>> > setting
>> > > >> wrong
>> > > >> >>>>>> > >>>> permissions?
>> > > >> >>>>>> > >>>>
>> > > >> >>>>>> > >>>> Colin
>> > > >> >>>>>> > >>>>
>> > > >> >>>>>> > >>
>> > > >> >>>>>> > >
>> > > >> >>>>>> > >
>> > > >> >>>>>> > >
>> > > >> >>>>>> > > --
>> > > >> >>>>>> > > Lei (Eddy) Xu
>> > > >> >>>>>> > > Software Engineer, Cloudera
>> > > >> >>>>>> >
>> > > >> >>>>>>
>> > > >> >>>>>
>> > > >> >>>>>
>> > > >> >>
>> > > >> >>
>> > > >> >>
>> > > >> >> --
>> > > >> >> Lei (Eddy) Xu
>> > > >> >> Software Engineer, Cloudera
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Sean
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Sean
>> > >
>> >
>>
>>
>>
>> --
>> Sean
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message