mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng, Da" <dzz...@amazon.com>
Subject Re: segmentation fault in master using mkdlnn
Date Thu, 03 May 2018 17:17:13 GMT
Hello Pedro,

I tried your instructions. It seems I can't run the docker in EC2 instances.
Where did you reproduce the error?

Thanks,
Da

+ echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
+ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg: directory `/root/.gnupg' created
gpg: new configuration file `/root/.gnupg/gpg.conf' created
gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/root/.gnupg/secring.gpg' created
gpg: keyring `/root/.gnupg/pubring.gpg' created
gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
gpg: keyserver timed out
gpg: keyserver receive failed: keyserver error
The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
Traceback (most recent call last):
  File "ci/build.py", line 263, in <module>
    sys.exit(main())
  File "ci/build.py", line 197, in main
    build_docker(platform, docker_binary)
  File "ci/build.py", line 73, in build_docker
    check_call(cmd)
  File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['docker', 'build', '-f', 'docker/Dockerfile.build.ubuntu_cpu',
'--build-arg', 'USER_ID=1000', '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero
exit status 2


On 5/3/18, 8:01 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com> wrote:

    Hi Da
    
    Reproduction instructions:
    
    On the host:
    
    Adjust core pattern:
    
    $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
    
    
    Use the following patch:
    
    ===============
    
    diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
    --- a/3rdparty/mkldnn
    +++ b/3rdparty/mkldnn
    @@ -1 +1 @@
    -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
    +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
    diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
    index 027e287..62649c9 100755
    --- a/ci/docker/runtime_functions.sh
    +++ b/ci/docker/runtime_functions.sh
    @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
         # https://github.com/apache/incubator-mxnet/issues/10026
         #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
         export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
    -    nosetests-2.7 --verbose tests/python/unittest
    -    nosetests-2.7 --verbose tests/python/train
    -    nosetests-2.7 --verbose tests/python/quantization
    +    export MXNET_TEST_SEED=11
    +    export MXNET_MODULE_SEED=812478194
    +    pwd
    +    export MXNET_TEST_COUNT=10000
    +    ulimit -c unlimited
    +    ulimit -c
    +    while nosetests-2.7 --verbose
    tests/python/unittest/test_module.py:test_forward_reshape; do echo round;
    done
    +    #nosetests-2.7 --verbose tests/python/train
    +    #nosetests-2.7 --verbose tests/python/quantization
     }
    
     unittest_ubuntu_python3_cpu() {
    
    
    
    ==============
    
    Build and execute the test, make sure the repo is clean
    
    $ ci/docker/runtime_functions.sh clean_repo
    
    $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
    build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
    /work/runtime_functions.sh unittest_ubuntu_python2_cpu
    
    
    Once it crashes it will stop.
    
    Then go in the container:
    
    
    $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
    
    A core should be there.
    
    you might need to install gdb as root by executing the previous command
    without uid so you can use apt-get.
    
    
    
    
    Good luck.
    
    
    
    
    
    
    
    On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzzhen@amazon.com> wrote:
    
    > Thanks a lot for locating the error.
    > Could you tell me How you reproduce the error?
    >
    > On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com> wrote:
    >
    >     Looks like a problem in mkl's same_shape
    >
    >     the pointer to mkldnn::memory::desc &desc  looks invalid.
    >
    >     (More stack frames follow...)
    >     (gdb) p desc
    >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading variable>
    >     (gdb) p dtype
    >     $2 = 0
    >     (gdb) p shape
    >     $3 = (const mxnet::TShape &) @0x7f3905a58b50: {<nnvm::Tuple<long>>
=
    >     {static kStackCache = <optimized out>, ndim_ = 2, num_heap_allocated_
    > = 0,
    >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ = 0x0}, <No
    > data
    >     fields>}
    >     (gdb)
    >
    >
    >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzzhen@amazon.com> wrote:
    >
    >     > There are a few problems with valgrind, which makes it not an ideal
    > tool
    >     > for mxnet with python interface.
    >     >
    >     > First, valgrind generates a huge number of irrelevant messages, most
    > of
    >     > them from in Python itself.
    >     >
    >     > Second, valgrind can't emulate all CPU instructions. I remember that
    > when
    >     > I run valgrind with mxnet, valgrind exits with a strange error. I
    > later on
    >     > found that it was caused by an unsupported CPU instructions.
    >     >
    >     > Third, valgrind doesn't support multithreading well. As far as I
    > know,
    >     > valgrind runs everything in a single thread even if the program uses
    >     > multi-threading. An error like this, which is likely caused by race
    >     > condition, can't be caught by valgrind.
    >     >
    >     > I used to use Address Sanitizer for memory errors. This tool is much
    >     > faster and can work with multi-threads. However, it doesn't work with
    >     > Python for some reason.
    >     >
    >     > One thing we potentially can do is to use memory checker for C++ unit
    >     > tests. Not sure it'll cover all memory errors we want.
    >     >
    >     > Best,
    >     > Da
    >     >
    >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com>
    > wrote:
    >     >
    >     >     It's very difficult to reproduce, non-deterministic. We were also
    >     > running
    >     >     without signal handlers in CI so there are no stack traces
    >     > unfortunately.
    >     >
    >     >     Care to elaborate why valgrind doesn't work with Python?
    >     >
    >     >
    >     >
    >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <zhengda1936@gmail.com>
    >     > wrote:
    >     >
    >     >     > can we build it in CI?segfault doesn't happen infrequently.
    >     >     >
    >     >     > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com>写道:
    >     >     >
    >     >     > > you can try Intel Inspector, which is like an enhanced
    > version of
    >     >     > valgrind
    >     >     > > with a GUI and whatnot.
    >     >     > >
    >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
    > zhengda1936@gmail.com>
    >     > wrote:
    >     >     > >
    >     >     > > > valgrind doesn't work with Python. also, valgrind doesn't
    >     > support some
    >     >     > > > CPU instructions used by MXNet (I think some instructions
    >     > related to
    >     >     > > > random generator).
    >     >     > > >
    >     >     > > >
    >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker <
    >     > bhavinthaker@gmail.com>
    >     >     > > > wrote:
    >     >     > > > > Have you tried running with valgrind to get some
clues
    > on the
    >     >     > > root-cause?
    >     >     > > > >
    >     >     > > > > Bhavin Thaker.
    >     >     > > > >
    >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da Zheng <
    > zhengda1936@gmail.com
    >     > >
    >     >     > wrote:
    >     >     > > > >
    >     >     > > > >> It might also be possible that this isn't an
MKLDNN bug.
    >     >     > > > >> I just saw a similar memory error without MKLDNN
build.
    >     >     > > > >>
    >     >     > > > >>
    >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    >     > organizations/jenkins/
    >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
    >     >     > > > >>
    >     >     > > > >> Best,
    >     >     > > > >> Da
    >     >     > > > >>
    >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM, Zheng, Da <
    > dzzhen@amazon.com>
    >     >     > wrote:
    >     >     > > > >> > There might be a race condition that causes
the memory
    >     > error.
    >     >     > > > >> > It might be caused by this PR:
    >     >     > > > >> > https://github.com/apache/incubator-mxnet/pull/10706/
    > files
    >     >     > > > >> > This PR removes MKLDNN memory from NDArray.
    >     >     > > > >> > However, I don't know why this causes memory
error. If
    >     > someone is
    >     >     > > > using
    >     >     > > > >> the memory, it should still hold the memory
with shared
    >     > pointer.
    >     >     > > > >> > But I do see the memory error increase
after this PR
    > is
    >     > merged.
    >     >     > > > >> >
    >     >     > > > >> > Best,
    >     >     > > > >> > Da
    >     >     > > > >> >
    >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro Larroy" <
    >     >     > pedro.larroy.lists@gmail.com>
    >     >     > > > >> wrote:
    >     >     > > > >> >
    >     >     > > > >> >     I couldn't reproduce locally with:
    >     >     > > > >> >
    >     >     > > > >> >     ci/build.py -p ubuntu_cpu
    > /work/runtime_functions.sh
    >     >     > > > >> >     build_ubuntu_cpu_mkldnn &&
ci/build.py --platform
    >     > ubuntu_cpu
    >     >     > > > >> >     /work/runtime_functions.sh
    > unittest_ubuntu_python2_cpu
    >     >     > > > >> >
    >     >     > > > >> >
    >     >     > > > >> >     On Wed, May 2, 2018 at 8:50 PM, Pedro
Larroy <
    >     >     > > > >> pedro.larroy.lists@gmail.com>
    >     >     > > > >> >     wrote:
    >     >     > > > >> >
    >     >     > > > >> >     > Hi
    >     >     > > > >> >     >
    >     >     > > > >> >     > Seems master is not running  anymore,
there's a
    >     > segmentation
    >     >     > > > fault
    >     >     > > > >> using
    >     >     > > > >> >     > MKDLNN-CPU
    >     >     > > > >> >     >
    >     >     > > > >> >     >
    >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    >     > organizations/jenkins/
    >     >     > > > >> >     > incubator-mxnet/detail/master/801/pipeline/662
    >     >     > > > >> >     >
    >     >     > > > >> >     >
    >     >     > > > >> >     > I see my PRs failing with a similar
error.
    >     >     > > > >> >     >
    >     >     > > > >> >     > Pedro
    >     >     > > > >> >     >
    >     >     > > > >> >
    >     >     > > > >> >
    >     >     > > > >>
    >     >     > > >
    >     >     > >
    >     >     >
    >     >
    >     >
    >     >
    >
    >
    >
    

Mime
View raw message