mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Larroy <pedro.larroy.li...@gmail.com>
Subject Re: segmentation fault in master using mkdlnn
Date Thu, 03 May 2018 13:57:25 GMT
@Chris seems intel inspector requires purchasing right? maybe some of us
already owns a license and can execute the test that fails intermittently?
 test_module.py:test_forward_reshape

On Thu, May 3, 2018 at 3:49 PM, Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> It's very difficult to reproduce, non-deterministic. We were also running
> without signal handlers in CI so there are no stack traces unfortunately.
>
> Care to elaborate why valgrind doesn't work with Python?
>
>
>
> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com> wrote:
>
>> can we build it in CI?segfault doesn't happen infrequently.
>>
>> 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com>写道:
>>
>> > you can try Intel Inspector, which is like an enhanced version of
>> valgrind
>> > with a GUI and whatnot.
>> >
>> > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1936@gmail.com> wrote:
>> >
>> > > valgrind doesn't work with Python. also, valgrind doesn't support some
>> > > CPU instructions used by MXNet (I think some instructions related to
>> > > random generator).
>> > >
>> > >
>> > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <bhavinthaker@gmail.com
>> >
>> > > wrote:
>> > > > Have you tried running with valgrind to get some clues on the
>> > root-cause?
>> > > >
>> > > > Bhavin Thaker.
>> > > >
>> > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1936@gmail.com>
>> wrote:
>> > > >
>> > > >> It might also be possible that this isn't an MKLDNN bug.
>> > > >> I just saw a similar memory error without MKLDNN build.
>> > > >>
>> > > >>
>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > incubator-mxnet/detail/PR-10783/1/pipeline
>> > > >>
>> > > >> Best,
>> > > >> Da
>> > > >>
>> > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzzhen@amazon.com>
>> wrote:
>> > > >> > There might be a race condition that causes the memory error.
>> > > >> > It might be caused by this PR:
>> > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
>> > > >> > This PR removes MKLDNN memory from NDArray.
>> > > >> > However, I don't know why this causes memory error. If someone
is
>> > > using
>> > > >> the memory, it should still hold the memory with shared pointer.
>> > > >> > But I do see the memory error increase after this PR is merged.
>> > > >> >
>> > > >> > Best,
>> > > >> > Da
>> > > >> >
>> > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>> pedro.larroy.lists@gmail.com>
>> > > >> wrote:
>> > > >> >
>> > > >> >     I couldn't reproduce locally with:
>> > > >> >
>> > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>> > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
ubuntu_cpu
>> > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>> > > >> >
>> > > >> >
>> > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
>> > > >> pedro.larroy.lists@gmail.com>
>> > > >> >     wrote:
>> > > >> >
>> > > >> >     > Hi
>> > > >> >     >
>> > > >> >     > Seems master is not running  anymore, there's a
>> segmentation
>> > > fault
>> > > >> using
>> > > >> >     > MKDLNN-CPU
>> > > >> >     >
>> > > >> >     >
>> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>> > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>> > > >> >     >
>> > > >> >     >
>> > > >> >     > I see my PRs failing with a similar error.
>> > > >> >     >
>> > > >> >     > Pedro
>> > > >> >     >
>> > > >> >
>> > > >> >
>> > > >>
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message