mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Larroy <pedro.larroy.li...@gmail.com>
Subject Re: segmentation fault in master using mkdlnn
Date Thu, 03 May 2018 16:58:23 GMT
I tried to compile with MKLDNN with Cmake + CLION and found some
difficulties, even though I have mkldnn in the 3rdparty folder and
installed mkl in user local.

What are exactly the steps to compile with MKLDNN with Cmake? I saw this
documented only for Make.

Pedro.

On Thu, May 3, 2018 at 4:59 PM, Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> Hi Da
>
> Reproduction instructions:
>
> On the host:
>
> Adjust core pattern:
>
> $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>
>
> Use the following patch:
>
> ===============
>
> diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
> --- a/3rdparty/mkldnn
> +++ b/3rdparty/mkldnn
> @@ -1 +1 @@
> -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
> +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
> diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.
> sh
> index 027e287..62649c9 100755
> --- a/ci/docker/runtime_functions.sh
> +++ b/ci/docker/runtime_functions.sh
> @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>      # https://github.com/apache/incubator-mxnet/issues/10026
>      #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>      export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> -    nosetests-2.7 --verbose tests/python/unittest
> -    nosetests-2.7 --verbose tests/python/train
> -    nosetests-2.7 --verbose tests/python/quantization
> +    export MXNET_TEST_SEED=11
> +    export MXNET_MODULE_SEED=812478194
> +    pwd
> +    export MXNET_TEST_COUNT=10000
> +    ulimit -c unlimited
> +    ulimit -c
> +    while nosetests-2.7 --verbose tests/python/unittest/test_
> module.py:test_forward_reshape; do echo round; done
> +    #nosetests-2.7 --verbose tests/python/train
> +    #nosetests-2.7 --verbose tests/python/quantization
>  }
>
>  unittest_ubuntu_python3_cpu() {
>
>
>
> ==============
>
> Build and execute the test, make sure the repo is clean
>
> $ ci/docker/runtime_functions.sh clean_repo
>
> $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
> Once it crashes it will stop.
>
> Then go in the container:
>
>
> $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>
> A core should be there.
>
> you might need to install gdb as root by executing the previous command
> without uid so you can use apt-get.
>
>
>
>
> Good luck.
>
>
>
>
>
>
>
> On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzzhen@amazon.com> wrote:
>
>> Thanks a lot for locating the error.
>> Could you tell me How you reproduce the error?
>>
>> On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com> wrote:
>>
>>     Looks like a problem in mkl's same_shape
>>
>>     the pointer to mkldnn::memory::desc &desc  looks invalid.
>>
>>     (More stack frames follow...)
>>     (gdb) p desc
>>     $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
>>     (gdb) p dtype
>>     $2 = 0
>>     (gdb) p shape
>>     $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>>
=
>>     {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_
>> = 0,
>>         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0},
>> <No data
>>     fields>}
>>     (gdb)
>>
>>
>>     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzzhen@amazon.com> wrote:
>>
>>     > There are a few problems with valgrind, which makes it not an ideal
>> tool
>>     > for mxnet with python interface.
>>     >
>>     > First, valgrind generates a huge number of irrelevant messages,
>> most of
>>     > them from in Python itself.
>>     >
>>     > Second, valgrind can't emulate all CPU instructions. I remember
>> that when
>>     > I run valgrind with mxnet, valgrind exits with a strange error. I
>> later on
>>     > found that it was caused by an unsupported CPU instructions.
>>     >
>>     > Third, valgrind doesn't support multithreading well. As far as I
>> know,
>>     > valgrind runs everything in a single thread even if the program uses
>>     > multi-threading. An error like this, which is likely caused by race
>>     > condition, can't be caught by valgrind.
>>     >
>>     > I used to use Address Sanitizer for memory errors. This tool is much
>>     > faster and can work with multi-threads. However, it doesn't work
>> with
>>     > Python for some reason.
>>     >
>>     > One thing we potentially can do is to use memory checker for C++
>> unit
>>     > tests. Not sure it'll cover all memory errors we want.
>>     >
>>     > Best,
>>     > Da
>>     >
>>     > On 5/3/18, 6:50 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com>
>> wrote:
>>     >
>>     >     It's very difficult to reproduce, non-deterministic. We were
>> also
>>     > running
>>     >     without signal handlers in CI so there are no stack traces
>>     > unfortunately.
>>     >
>>     >     Care to elaborate why valgrind doesn't work with Python?
>>     >
>>     >
>>     >
>>     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com
>> >
>>     > wrote:
>>     >
>>     >     > can we build it in CI?segfault doesn't happen infrequently.
>>     >     >
>>     >     > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com>写道:
>>     >     >
>>     >     > > you can try Intel Inspector, which is like an enhanced
>> version of
>>     >     > valgrind
>>     >     > > with a GUI and whatnot.
>>     >     > >
>>     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
>> zhengda1936@gmail.com>
>>     > wrote:
>>     >     > >
>>     >     > > > valgrind doesn't work with Python. also, valgrind doesn't
>>     > support some
>>     >     > > > CPU instructions used by MXNet (I think some instructions
>>     > related to
>>     >     > > > random generator).
>>     >     > > >
>>     >     > > >
>>     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
>>     > bhavinthaker@gmail.com>
>>     >     > > > wrote:
>>     >     > > > > Have you tried running with valgrind to get some
clues
>> on the
>>     >     > > root-cause?
>>     >     > > > >
>>     >     > > > > Bhavin Thaker.
>>     >     > > > >
>>     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
>> zhengda1936@gmail.com
>>     > >
>>     >     > wrote:
>>     >     > > > >
>>     >     > > > >> It might also be possible that this isn't an
MKLDNN
>> bug.
>>     >     > > > >> I just saw a similar memory error without MKLDNN
build.
>>     >     > > > >>
>>     >     > > > >>
>>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>>     >     > > > >>
>>     >     > > > >> Best,
>>     >     > > > >> Da
>>     >     > > > >>
>>     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
>> dzzhen@amazon.com>
>>     >     > wrote:
>>     >     > > > >> > There might be a race condition that causes
the
>> memory
>>     > error.
>>     >     > > > >> > It might be caused by this PR:
>>     >     > > > >> > https://github.com/apache/incu
>> bator-mxnet/pull/10706/files
>>     >     > > > >> > This PR removes MKLDNN memory from NDArray.
>>     >     > > > >> > However, I don't know why this causes memory
error.
>> If
>>     > someone is
>>     >     > > > using
>>     >     > > > >> the memory, it should still hold the memory
with shared
>>     > pointer.
>>     >     > > > >> > But I do see the memory error increase
after this PR
>> is
>>     > merged.
>>     >     > > > >> >
>>     >     > > > >> > Best,
>>     >     > > > >> > Da
>>     >     > > > >> >
>>     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
>>     >     > pedro.larroy.lists@gmail.com>
>>     >     > > > >> wrote:
>>     >     > > > >> >
>>     >     > > > >> >     I couldn't reproduce locally with:
>>     >     > > > >> >
>>     >     > > > >> >     ci/build.py -p ubuntu_cpu
>> /work/runtime_functions.sh
>>     >     > > > >> >     build_ubuntu_cpu_mkldnn &&
ci/build.py --platform
>>     > ubuntu_cpu
>>     >     > > > >> >     /work/runtime_functions.sh
>> unittest_ubuntu_python2_cpu
>>     >     > > > >> >
>>     >     > > > >> >
>>     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
Larroy <
>>     >     > > > >> pedro.larroy.lists@gmail.com>
>>     >     > > > >> >     wrote:
>>     >     > > > >> >
>>     >     > > > >> >     > Hi
>>     >     > > > >> >     >
>>     >     > > > >> >     > Seems master is not running  anymore,
there's a
>>     > segmentation
>>     >     > > > fault
>>     >     > > > >> using
>>     >     > > > >> >     > MKDLNN-CPU
>>     >     > > > >> >     >
>>     >     > > > >> >     >
>>     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
>>     >     > > > >> >     >
>>     >     > > > >> >     >
>>     >     > > > >> >     > I see my PRs failing with a similar
error.
>>     >     > > > >> >     >
>>     >     > > > >> >     > Pedro
>>     >     > > > >> >     >
>>     >     > > > >> >
>>     >     > > > >> >
>>     >     > > > >>
>>     >     > > >
>>     >     > >
>>     >     >
>>     >
>>     >
>>     >
>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message