mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Larroy <pedro.larroy.li...@gmail.com>
Subject Re: CUDNN algorithm selection failure
Date Wed, 03 Oct 2018 16:32:33 GMT
Seems is not the only test:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline

test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been
touched for a while. It doesn't look like a problem with the test to me,
(not a flaky test). Looks to me that should find and address the root cause
instead of disabling the test in this case.

Pedro.

On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu
<marco.g.abreu@googlemail.com.invalid> wrote:

> I have created an issue at
> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable
> the test at https://github.com/apache/incubator-mxnet/pull/12716.
>
> This test is pretty new and was submitted with a number of other
> problematic (and disabled) tests:
> https://github.com/apache/incubator-mxnet/issues/11164 It could be
> possible
> that the test is simply not stable enough. The PR that introduced that test
> is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged
> two
> days ago.
>
> Best regards,
> Marco
>
> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <pedro.larroy.lists@gmail.com>
> wrote:
>
> > Thanks for checking Lin. If it happens again we will have to dig deeper.
> We
> > have just one executor in GPU so I wonder what could be the root cause of
> > this.
> >
> > On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apeforest@gmail.com> wrote:
> >
> > > I could not reproduce the error on an EC2 g3x8 instance making it hard
> to
> > > debug. I also suspect it was due to resource usage limit on ci
> >  Instance.
> > >
> > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > It doesn't look like flakiness to me at first sight. I think it might
> > be
> > > > related to resource usage / allocation / leak in the worst case.
> > > >
> > > > Could be that there was not enough memory GPU memory at the time of
> > test
> > > > execution. But I'm just speculating, hence my original question.
> > > >
> > > > Pedro.
> > > >
> > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apeforest@gmail.com> wrote:
> > > >
> > > > > Hi Pedro,
> > > > >
> > > > > I also got this failure in my PR
> > > > >
> > > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> > > > >
> > > > > I was not able to identify the root cause of it from changelist.
> Are
> > > you
> > > > > suggesting there is some flakiness in the master branch too?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Lin
> > > > >
> > > > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > > I saw this failure on CI:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
> > > > > >
> > > > > > Have you seen other cases where we fail to select the best CUDNN
> > > > > algorithm?
> > > > > > In which circumstances this could happen, and do you think is
a
> > good
> > > > idea
> > > > > > to have one selected by default as a last resort?
> > > > > >
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message