mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng, Da" <dzz...@amazon.com>
Subject Re: segmentation fault in master using mkdlnn
Date Thu, 03 May 2018 14:51:18 GMT
Thanks a lot for locating the error.
Could you tell me How you reproduce the error? 

On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com> wrote:

    Looks like a problem in mkl's same_shape
    
    the pointer to mkldnn::memory::desc &desc  looks invalid.
    
    (More stack frames follow...)
    (gdb) p desc
    $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
    (gdb) p dtype
    $2 = 0
    (gdb) p shape
    $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>> =
    {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_ = 0,
        data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No data
    fields>}
    (gdb)
    
    
    On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzzhen@amazon.com> wrote:
    
    > There are a few problems with valgrind, which makes it not an ideal tool
    > for mxnet with python interface.
    >
    > First, valgrind generates a huge number of irrelevant messages, most of
    > them from in Python itself.
    >
    > Second, valgrind can't emulate all CPU instructions. I remember that when
    > I run valgrind with mxnet, valgrind exits with a strange error. I later on
    > found that it was caused by an unsupported CPU instructions.
    >
    > Third, valgrind doesn't support multithreading well. As far as I know,
    > valgrind runs everything in a single thread even if the program uses
    > multi-threading. An error like this, which is likely caused by race
    > condition, can't be caught by valgrind.
    >
    > I used to use Address Sanitizer for memory errors. This tool is much
    > faster and can work with multi-threads. However, it doesn't work with
    > Python for some reason.
    >
    > One thing we potentially can do is to use memory checker for C++ unit
    > tests. Not sure it'll cover all memory errors we want.
    >
    > Best,
    > Da
    >
    > On 5/3/18, 6:50 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com> wrote:
    >
    >     It's very difficult to reproduce, non-deterministic. We were also
    > running
    >     without signal handlers in CI so there are no stack traces
    > unfortunately.
    >
    >     Care to elaborate why valgrind doesn't work with Python?
    >
    >
    >
    >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com>
    > wrote:
    >
    >     > can we build it in CI?segfault doesn't happen infrequently.
    >     >
    >     > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com>写道:
    >     >
    >     > > you can try Intel Inspector, which is like an enhanced version of
    >     > valgrind
    >     > > with a GUI and whatnot.
    >     > >
    >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <zhengda1936@gmail.com>
    > wrote:
    >     > >
    >     > > > valgrind doesn't work with Python. also, valgrind doesn't
    > support some
    >     > > > CPU instructions used by MXNet (I think some instructions
    > related to
    >     > > > random generator).
    >     > > >
    >     > > >
    >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
    > bhavinthaker@gmail.com>
    >     > > > wrote:
    >     > > > > Have you tried running with valgrind to get some clues on
the
    >     > > root-cause?
    >     > > > >
    >     > > > > Bhavin Thaker.
    >     > > > >
    >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <zhengda1936@gmail.com
    > >
    >     > wrote:
    >     > > > >
    >     > > > >> It might also be possible that this isn't an MKLDNN bug.
    >     > > > >> I just saw a similar memory error without MKLDNN build.
    >     > > > >>
    >     > > > >>
    >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     > > incubator-mxnet/detail/PR-10783/1/pipeline
    >     > > > >>
    >     > > > >> Best,
    >     > > > >> Da
    >     > > > >>
    >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <dzzhen@amazon.com>
    >     > wrote:
    >     > > > >> > There might be a race condition that causes the
memory
    > error.
    >     > > > >> > It might be caused by this PR:
    >     > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/files
    >     > > > >> > This PR removes MKLDNN memory from NDArray.
    >     > > > >> > However, I don't know why this causes memory error.
If
    > someone is
    >     > > > using
    >     > > > >> the memory, it should still hold the memory with shared
    > pointer.
    >     > > > >> > But I do see the memory error increase after this
PR is
    > merged.
    >     > > > >> >
    >     > > > >> > Best,
    >     > > > >> > Da
    >     > > > >> >
    >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
    >     > pedro.larroy.lists@gmail.com>
    >     > > > >> wrote:
    >     > > > >> >
    >     > > > >> >     I couldn't reproduce locally with:
    >     > > > >> >
    >     > > > >> >     ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    >     > > > >> >     build_ubuntu_cpu_mkldnn && ci/build.py
--platform
    > ubuntu_cpu
    >     > > > >> >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
    >     > > > >> >
    >     > > > >> >
    >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy
<
    >     > > > >> pedro.larroy.lists@gmail.com>
    >     > > > >> >     wrote:
    >     > > > >> >
    >     > > > >> >     > Hi
    >     > > > >> >     >
    >     > > > >> >     > Seems master is not running  anymore, there's
a
    > segmentation
    >     > > > fault
    >     > > > >> using
    >     > > > >> >     > MKDLNN-CPU
    >     > > > >> >     >
    >     > > > >> >     >
    >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
    >     > > > >> >     >
    >     > > > >> >     >
    >     > > > >> >     > I see my PRs failing with a similar error.
    >     > > > >> >     >
    >     > > > >> >     > Pedro
    >     > > > >> >     >
    >     > > > >> >
    >     > > > >> >
    >     > > > >>
    >     > > >
    >     > >
    >     >
    >
    >
    >
    

Mime
View raw message