mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Larroy <pedro.larroy.li...@gmail.com>
Subject Re: segmentation fault in master using mkdlnn
Date Thu, 03 May 2018 14:40:08 GMT
Hi Bhavin

Good suggestion

I tried 1) but I can't get a core inside the container, even with ulimit -c
unlimited
I found out that  /proc/sys/kernel/core_pattern  by default in ubuntu uses
a pipe to /usr/share/apport/apport  which doesn't exist inside the
container,

changing it outside the container to echo 'core.%h.%e.%t' >
/proc/sys/kernel/core_pattern  fixes this mistery, so now I got a coredump
which I added to the ticket.

Trying to get to the bottom of the issue :-)



On Thu, May 3, 2018 at 4:02 PM, Bhavin Thaker <bhavinthaker@gmail.com>
wrote:

> Hi Pedro, All,
>
> 1) I would suggest that we run “ulimit -c unlimited” in every CI Slave
> machine at startup to enable core-dump and get stack trace.
>
> 2) Valgrind on Python generates so much noise that extracting signal from
> it is painful, but it is still worth trying it out and look at the messages
> towards the end when the crash happens.  Valgrind on a one-liner python
> code generates noise and this demonstrates that python itself is not
> Valgrind-clean.
>
> 3) If there are C++ APIs to trigger the same functionality as the current
> problematic use-case, then one could write a small program to reproduce the
> crash and then use Valgrind to get to the culprit portion of the code
> quickly.
>
> Bhavin Thaker.
>
> On Thu, May 3, 2018 at 6:49 AM Pedro Larroy <pedro.larroy.lists@gmail.com>
> wrote:
>
> > It's very difficult to reproduce, non-deterministic. We were also running
> > without signal handlers in CI so there are no stack traces unfortunately.
> >
> > Care to elaborate why valgrind doesn't work with Python?
> >
> >
> >
> > On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com> wrote:
> >
> > > can we build it in CI?segfault doesn't happen infrequently.
> > >
> > > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com>写道:
> > >
> > > > you can try Intel Inspector, which is like an enhanced version of
> > > valgrind
> > > > with a GUI and whatnot.
> > > >
> > > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1936@gmail.com>
> wrote:
> > > >
> > > > > valgrind doesn't work with Python. also, valgrind doesn't support
> > some
> > > > > CPU instructions used by MXNet (I think some instructions related
> to
> > > > > random generator).
> > > > >
> > > > >
> > > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
> > bhavinthaker@gmail.com>
> > > > > wrote:
> > > > > > Have you tried running with valgrind to get some clues on the
> > > > root-cause?
> > > > > >
> > > > > > Bhavin Thaker.
> > > > > >
> > > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1936@gmail.com>
> > > wrote:
> > > > > >
> > > > > >> It might also be possible that this isn't an MKLDNN bug.
> > > > > >> I just saw a similar memory error without MKLDNN build.
> > > > > >>
> > > > > >>
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > incubator-mxnet/detail/PR-10783/1/pipeline
> > > > > >>
> > > > > >> Best,
> > > > > >> Da
> > > > > >>
> > > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzzhen@amazon.com>
> > > wrote:
> > > > > >> > There might be a race condition that causes the memory
error.
> > > > > >> > It might be caused by this PR:
> > > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
> > > > > >> > This PR removes MKLDNN memory from NDArray.
> > > > > >> > However, I don't know why this causes memory error.
If someone
> > is
> > > > > using
> > > > > >> the memory, it should still hold the memory with shared
pointer.
> > > > > >> > But I do see the memory error increase after this PR
is
> merged.
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Da
> > > > > >> >
> > > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
> > > pedro.larroy.lists@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >     I couldn't reproduce locally with:
> > > > > >> >
> > > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> > > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
--platform
> ubuntu_cpu
> > > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
> > > > > >> >
> > > > > >> >
> > > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy <
> > > > > >> pedro.larroy.lists@gmail.com>
> > > > > >> >     wrote:
> > > > > >> >
> > > > > >> >     > Hi
> > > > > >> >     >
> > > > > >> >     > Seems master is not running  anymore, there's
a
> > segmentation
> > > > > fault
> > > > > >> using
> > > > > >> >     > MKDLNN-CPU
> > > > > >> >     >
> > > > > >> >     >
> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
> > > > > >> >     >
> > > > > >> >     >
> > > > > >> >     > I see my PRs failing with a similar error.
> > > > > >> >     >
> > > > > >> >     > Pedro
> > > > > >> >     >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message