mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naveen Swamy <mnnav...@gmail.com>
Subject Re: CUDNN algorithm selection failure
Date Thu, 04 Oct 2018 17:54:06 GMT
Looking at the error raised, you can see that the workspace size(GPU mem
size) of 1GB isn't sufficient. I am wondering if it is due to tests running
in parallel on CI, if this is true(tests running in parallel) is it
possible to reduce the parallelism ?
Error:
"mxnet.base.MXNetError: [05:40:12]
src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any
forward convolution algorithm.  with workspace size of 1073741824 bytes,
please consider reducing batch/model size or increasing the workspace size"

I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't
reproduce the issue. I will look into it further to see if there are other
alternatives.


On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai <ghai.piyush@gmail.com> wrote:

> Another build where test_slice_batchnorm_reshape_batchnorm fails :
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> <
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> >
>
> —
> Piyush
>
> > On Oct 3, 2018, at 9:32 AM, Pedro Larroy <pedro.larroy.lists@gmail.com>
> wrote:
> >
> > Seems is not the only test:
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline
> >
> > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been
> > touched for a while. It doesn't look like a problem with the test to me,
> > (not a flaky test). Looks to me that should find and address the root
> cause
> > instead of disabling the test in this case.
> >
> > Pedro.
> >
> > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu
> > <marco.g.abreu@googlemail.com.invalid> wrote:
> >
> >> I have created an issue at
> >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to
> disable
> >> the test at https://github.com/apache/incubator-mxnet/pull/12716.
> >>
> >> This test is pretty new and was submitted with a number of other
> >> problematic (and disabled) tests:
> >> https://github.com/apache/incubator-mxnet/issues/11164 It could be
> >> possible
> >> that the test is simply not stable enough. The PR that introduced that
> test
> >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged
> >> two
> >> days ago.
> >>
> >> Best regards,
> >> Marco
> >>
> >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> >> wrote:
> >>
> >>> Thanks for checking Lin. If it happens again we will have to dig
> deeper.
> >> We
> >>> have just one executor in GPU so I wonder what could be the root cause
> of
> >>> this.
> >>>
> >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apeforest@gmail.com> wrote:
> >>>
> >>>> I could not reproduce the error on an EC2 g3x8 instance making it hard
> >> to
> >>>> debug. I also suspect it was due to resource usage limit on ci
> >>> Instance.
> >>>>
> >>>> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <
> >>> pedro.larroy.lists@gmail.com
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> It doesn't look like flakiness to me at first sight. I think it
might
> >>> be
> >>>>> related to resource usage / allocation / leak in the worst case.
> >>>>>
> >>>>> Could be that there was not enough memory GPU memory at the time
of
> >>> test
> >>>>> execution. But I'm just speculating, hence my original question.
> >>>>>
> >>>>> Pedro.
> >>>>>
> >>>>> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apeforest@gmail.com>
wrote:
> >>>>>
> >>>>>> Hi Pedro,
> >>>>>>
> >>>>>> I also got this failure in my PR
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> >>>>>>
> >>>>>> I was not able to identify the root cause of it from changelist.
> >> Are
> >>>> you
> >>>>>> suggesting there is some flakiness in the master branch too?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Lin
> >>>>>>
> >>>>>> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
> >>>>> pedro.larroy.lists@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi
> >>>>>>>
> >>>>>>> I saw this failure on CI:
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
> >>>>>>>
> >>>>>>> Have you seen other cases where we fail to select the best
CUDNN
> >>>>>> algorithm?
> >>>>>>> In which circumstances this could happen, and do you think
is a
> >>> good
> >>>>> idea
> >>>>>>> to have one selected by default as a last resort?
> >>>>>>>
> >>>>>>>
> >>>>>>> Pedro.
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message