mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com.INVALID>
Subject Re: CUDNN algorithm selection failure
Date Thu, 04 Oct 2018 18:58:37 GMT
For GPU, we don't run any tests in parallel.

-Marco

Naveen Swamy <mnnaveen@gmail.com> schrieb am Do., 4. Okt. 2018, 19:54:

> Looking at the error raised, you can see that the workspace size(GPU mem
> size) of 1GB isn't sufficient. I am wondering if it is due to tests running
> in parallel on CI, if this is true(tests running in parallel) is it
> possible to reduce the parallelism ?
> Error:
> "mxnet.base.MXNetError: [05:40:12]
> src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any
> forward convolution algorithm.  with workspace size of 1073741824 bytes,
> please consider reducing batch/model size or increasing the workspace size"
>
> I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't
> reproduce the issue. I will look into it further to see if there are other
> alternatives.
>
>
> On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai <ghai.piyush@gmail.com> wrote:
>
> > Another build where test_slice_batchnorm_reshape_batchnorm fails :
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> > <
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
> > >
> >
> > —
> > Piyush
> >
> > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > wrote:
> > >
> > > Seems is not the only test:
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline
> > >
> > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been
> > > touched for a while. It doesn't look like a problem with the test to
> me,
> > > (not a flaky test). Looks to me that should find and address the root
> > cause
> > > instead of disabling the test in this case.
> > >
> > > Pedro.
> > >
> > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu
> > > <marco.g.abreu@googlemail.com.invalid> wrote:
> > >
> > >> I have created an issue at
> > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to
> > disable
> > >> the test at https://github.com/apache/incubator-mxnet/pull/12716.
> > >>
> > >> This test is pretty new and was submitted with a number of other
> > >> problematic (and disabled) tests:
> > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be
> > >> possible
> > >> that the test is simply not stable enough. The PR that introduced that
> > test
> > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was
> merged
> > >> two
> > >> days ago.
> > >>
> > >> Best regards,
> > >> Marco
> > >>
> > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > >> wrote:
> > >>
> > >>> Thanks for checking Lin. If it happens again we will have to dig
> > deeper.
> > >> We
> > >>> have just one executor in GPU so I wonder what could be the root
> cause
> > of
> > >>> this.
> > >>>
> > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apeforest@gmail.com>
> wrote:
> > >>>
> > >>>> I could not reproduce the error on an EC2 g3x8 instance making
it
> hard
> > >> to
> > >>>> debug. I also suspect it was due to resource usage limit on ci
> > >>> Instance.
> > >>>>
> > >>>> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <
> > >>> pedro.larroy.lists@gmail.com
> > >>>>>
> > >>>> wrote:
> > >>>>
> > >>>>> It doesn't look like flakiness to me at first sight. I think
it
> might
> > >>> be
> > >>>>> related to resource usage / allocation / leak in the worst
case.
> > >>>>>
> > >>>>> Could be that there was not enough memory GPU memory at the
time of
> > >>> test
> > >>>>> execution. But I'm just speculating, hence my original question.
> > >>>>>
> > >>>>> Pedro.
> > >>>>>
> > >>>>> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apeforest@gmail.com>
> wrote:
> > >>>>>
> > >>>>>> Hi Pedro,
> > >>>>>>
> > >>>>>> I also got this failure in my PR
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> > >>>>>>
> > >>>>>> I was not able to identify the root cause of it from changelist.
> > >> Are
> > >>>> you
> > >>>>>> suggesting there is some flakiness in the master branch
too?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Lin
> > >>>>>>
> > >>>>>> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
> > >>>>> pedro.larroy.lists@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi
> > >>>>>>>
> > >>>>>>> I saw this failure on CI:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
> > >>>>>>>
> > >>>>>>> Have you seen other cases where we fail to select the
best CUDNN
> > >>>>>> algorithm?
> > >>>>>>> In which circumstances this could happen, and do you
think is a
> > >>> good
> > >>>>> idea
> > >>>>>>> to have one selected by default as a last resort?
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Pedro.
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message