mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhavin Thaker <bhavintha...@gmail.com>
Subject Re: segmentation fault in master using mkdlnn
Date Thu, 03 May 2018 14:02:12 GMT
Hi Pedro, All,

1) I would suggest that we run “ulimit -c unlimited” in every CI Slave
machine at startup to enable core-dump and get stack trace.

2) Valgrind on Python generates so much noise that extracting signal from
it is painful, but it is still worth trying it out and look at the messages
towards the end when the crash happens.  Valgrind on a one-liner python
code generates noise and this demonstrates that python itself is not
Valgrind-clean.

3) If there are C++ APIs to trigger the same functionality as the current
problematic use-case, then one could write a small program to reproduce the
crash and then use Valgrind to get to the culprit portion of the code
quickly.

Bhavin Thaker.

On Thu, May 3, 2018 at 6:49 AM Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> It's very difficult to reproduce, non-deterministic. We were also running
> without signal handlers in CI so there are no stack traces unfortunately.
>
> Care to elaborate why valgrind doesn't work with Python?
>
>
>
> On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com> wrote:
>
> > can we build it in CI?segfault doesn't happen infrequently.
> >
> > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com>写道:
> >
> > > you can try Intel Inspector, which is like an enhanced version of
> > valgrind
> > > with a GUI and whatnot.
> > >
> > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1936@gmail.com> wrote:
> > >
> > > > valgrind doesn't work with Python. also, valgrind doesn't support
> some
> > > > CPU instructions used by MXNet (I think some instructions related to
> > > > random generator).
> > > >
> > > >
> > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
> bhavinthaker@gmail.com>
> > > > wrote:
> > > > > Have you tried running with valgrind to get some clues on the
> > > root-cause?
> > > > >
> > > > > Bhavin Thaker.
> > > > >
> > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1936@gmail.com>
> > wrote:
> > > > >
> > > > >> It might also be possible that this isn't an MKLDNN bug.
> > > > >> I just saw a similar memory error without MKLDNN build.
> > > > >>
> > > > >>
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/PR-10783/1/pipeline
> > > > >>
> > > > >> Best,
> > > > >> Da
> > > > >>
> > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzzhen@amazon.com>
> > wrote:
> > > > >> > There might be a race condition that causes the memory error.
> > > > >> > It might be caused by this PR:
> > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
> > > > >> > This PR removes MKLDNN memory from NDArray.
> > > > >> > However, I don't know why this causes memory error. If someone
> is
> > > > using
> > > > >> the memory, it should still hold the memory with shared pointer.
> > > > >> > But I do see the memory error increase after this PR is
merged.
> > > > >> >
> > > > >> > Best,
> > > > >> > Da
> > > > >> >
> > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
> > pedro.larroy.lists@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> >     I couldn't reproduce locally with:
> > > > >> >
> > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py --platform
ubuntu_cpu
> > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> > > > >> >
> > > > >> >
> > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> > > > >> pedro.larroy.lists@gmail.com>
> > > > >> >     wrote:
> > > > >> >
> > > > >> >     > Hi
> > > > >> >     >
> > > > >> >     > Seems master is not running  anymore, there's a
> segmentation
> > > > fault
> > > > >> using
> > > > >> >     > MKDLNN-CPU
> > > > >> >     >
> > > > >> >     >
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
> > > > >> >     >
> > > > >> >     >
> > > > >> >     > I see my PRs failing with a similar error.
> > > > >> >     >
> > > > >> >     > Pedro
> > > > >> >     >
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message