mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com.INVALID>
Subject Re: CUDNN algorithm selection failure
Date Tue, 02 Oct 2018 09:38:40 GMT
I have created an issue at
https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable
the test at https://github.com/apache/incubator-mxnet/pull/12716.

This test is pretty new and was submitted with a number of other
problematic (and disabled) tests:
https://github.com/apache/incubator-mxnet/issues/11164 It could be possible
that the test is simply not stable enough. The PR that introduced that test
is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged two
days ago.

Best regards,
Marco

On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> Thanks for checking Lin. If it happens again we will have to dig deeper. We
> have just one executor in GPU so I wonder what could be the root cause of
> this.
>
> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apeforest@gmail.com> wrote:
>
> > I could not reproduce the error on an EC2 g3x8 instance making it hard to
> > debug. I also suspect it was due to resource usage limit on ci
>  Instance.
> >
> > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > It doesn't look like flakiness to me at first sight. I think it might
> be
> > > related to resource usage / allocation / leak in the worst case.
> > >
> > > Could be that there was not enough memory GPU memory at the time of
> test
> > > execution. But I'm just speculating, hence my original question.
> > >
> > > Pedro.
> > >
> > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apeforest@gmail.com> wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > I also got this failure in my PR
> > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> > > >
> > > > I was not able to identify the root cause of it from changelist. Are
> > you
> > > > suggesting there is some flakiness in the master branch too?
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > I saw this failure on CI:
> > > > >
> > > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
> > > > >
> > > > > Have you seen other cases where we fail to select the best CUDNN
> > > > algorithm?
> > > > > In which circumstances this could happen, and do you think is a
> good
> > > idea
> > > > > to have one selected by default as a last resort?
> > > > >
> > > > >
> > > > > Pedro.
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message