mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shaikh, Eftiquar" <eftiq...@amazon.com>
Subject Re: segmentation fault in master using mkdlnn
Date Thu, 03 May 2018 13:54:42 GMT
If the issue is platform neutral - I can try reproducing on Windows. A fault in native code
should produce a dump that can be analyzed. 
I am currently working on building mxnet from source, and can spend sometime on this. 

Sent from my iPhone

> On May 3, 2018, at 6:51 AM, Pedro Larroy <pedro.larroy.lists@gmail.com> wrote:
> 
> It's very difficult to reproduce, non-deterministic. We were also running
> without signal handlers in CI so there are no stack traces unfortunately.
> 
> Care to elaborate why valgrind doesn't work with Python?
> 
> 
> 
>> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com> wrote:
>> 
>> can we build it in CI?segfault doesn't happen infrequently.
>> 
>> 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com>写道:
>> 
>>> you can try Intel Inspector, which is like an enhanced version of
>> valgrind
>>> with a GUI and whatnot.
>>> 
>>>> On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1936@gmail.com> wrote:
>>>> 
>>>> valgrind doesn't work with Python. also, valgrind doesn't support some
>>>> CPU instructions used by MXNet (I think some instructions related to
>>>> random generator).
>>>> 
>>>> 
>>>> On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bhavinthaker@gmail.com>
>>>> wrote:
>>>>> Have you tried running with valgrind to get some clues on the
>>> root-cause?
>>>>> 
>>>>> Bhavin Thaker.
>>>>> 
>>>>> On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1936@gmail.com>
>> wrote:
>>>>> 
>>>>>> It might also be possible that this isn't an MKLDNN bug.
>>>>>> I just saw a similar memory error without MKLDNN build.
>>>>>> 
>>>>>> 
>>>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>> incubator-mxnet/detail/PR-10783/1/pipeline
>>>>>> 
>>>>>> Best,
>>>>>> Da
>>>>>> 
>>>>>> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzzhen@amazon.com>
>> wrote:
>>>>>>> There might be a race condition that causes the memory error.
>>>>>>> It might be caused by this PR:
>>>>>>> https://github.com/apache/incubator-mxnet/pull/10706/files
>>>>>>> This PR removes MKLDNN memory from NDArray.
>>>>>>> However, I don't know why this causes memory error. If someone
is
>>>> using
>>>>>> the memory, it should still hold the memory with shared pointer.
>>>>>>> But I do see the memory error increase after this PR is merged.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Da
>>>>>>> 
>>>>>>> On 5/2/18, 12:26 PM, "Pedro Larroy" <
>> pedro.larroy.lists@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>    I couldn't reproduce locally with:
>>>>>>> 
>>>>>>>    ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>>>>>>>    build_ubuntu_cpu_mkldnn && ci/build.py --platform
ubuntu_cpu
>>>>>>>    /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>>>>>>> 
>>>>>>> 
>>>>>>>    On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>>>>>> pedro.larroy.lists@gmail.com>
>>>>>>>    wrote:
>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> Seems master is not running  anymore, there's a segmentation
>>>> fault
>>>>>> using
>>>>>>>> MKDLNN-CPU
>>>>>>>> 
>>>>>>>> 
>>>> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>>>>>>> incubator-mxnet/detail/master/801/pipeline/662
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I see my PRs failing with a similar error.
>>>>>>>> 
>>>>>>>> Pedro
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
Mime
View raw message