mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piyush Ghai <ghai.piy...@gmail.com>
Subject Re: CUDNN algorithm selection failure
Date Thu, 04 Oct 2018 17:21:27 GMT
Another build where test_slice_batchnorm_reshape_batchnorm fails : 
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline
<http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline>

—
Piyush 

> On Oct 3, 2018, at 9:32 AM, Pedro Larroy <pedro.larroy.lists@gmail.com> wrote:
> 
> Seems is not the only test:
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline
> 
> test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been
> touched for a while. It doesn't look like a problem with the test to me,
> (not a flaky test). Looks to me that should find and address the root cause
> instead of disabling the test in this case.
> 
> Pedro.
> 
> On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu
> <marco.g.abreu@googlemail.com.invalid> wrote:
> 
>> I have created an issue at
>> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable
>> the test at https://github.com/apache/incubator-mxnet/pull/12716.
>> 
>> This test is pretty new and was submitted with a number of other
>> problematic (and disabled) tests:
>> https://github.com/apache/incubator-mxnet/issues/11164 It could be
>> possible
>> that the test is simply not stable enough. The PR that introduced that test
>> is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged
>> two
>> days ago.
>> 
>> Best regards,
>> Marco
>> 
>> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <pedro.larroy.lists@gmail.com>
>> wrote:
>> 
>>> Thanks for checking Lin. If it happens again we will have to dig deeper.
>> We
>>> have just one executor in GPU so I wonder what could be the root cause of
>>> this.
>>> 
>>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apeforest@gmail.com> wrote:
>>> 
>>>> I could not reproduce the error on an EC2 g3x8 instance making it hard
>> to
>>>> debug. I also suspect it was due to resource usage limit on ci
>>> Instance.
>>>> 
>>>> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy <
>>> pedro.larroy.lists@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> It doesn't look like flakiness to me at first sight. I think it might
>>> be
>>>>> related to resource usage / allocation / leak in the worst case.
>>>>> 
>>>>> Could be that there was not enough memory GPU memory at the time of
>>> test
>>>>> execution. But I'm just speculating, hence my original question.
>>>>> 
>>>>> Pedro.
>>>>> 
>>>>> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apeforest@gmail.com> wrote:
>>>>> 
>>>>>> Hi Pedro,
>>>>>> 
>>>>>> I also got this failure in my PR
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
>>>>>> 
>>>>>> I was not able to identify the root cause of it from changelist.
>> Are
>>>> you
>>>>>> suggesting there is some flakiness in the master branch too?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Lin
>>>>>> 
>>>>>> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
>>>>> pedro.larroy.lists@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi
>>>>>>> 
>>>>>>> I saw this failure on CI:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
>>>>>>> 
>>>>>>> Have you seen other cases where we fail to select the best CUDNN
>>>>>> algorithm?
>>>>>>> In which circumstances this could happen, and do you think is
a
>>> good
>>>>> idea
>>>>>>> to have one selected by default as a last resort?
>>>>>>> 
>>>>>>> 
>>>>>>> Pedro.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message