mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Da Zheng <zhengda1...@gmail.com>
Subject Re: segmentation fault in master using mkdlnn
Date Fri, 04 May 2018 19:17:38 GMT
I have come up a temporary solution for this memory error.
https://github.com/apache/incubator-mxnet/pull/10812
I tested with Anirudh's command. It works fine.

I call it a temporary solution because it only fixes the segfault. It
seems to me that the race condition can potentially corrupt data in
the input array even without MKLDNN. Please see the description in my
PR for more details.

Best,
Da

On Fri, May 4, 2018 at 12:14 PM, Zheng, Da <dzzhen@amazon.com> wrote:
> Hello Pedro,
>
> I did exactly what you said in your previous email.
>
> I edit ci/docker/runtime_functions.sh based on your patch and here is the history of
running your commands:
>  2004  vim ci/docker/runtime_functions.sh
>  2005  ci/docker/runtime_functions.sh clean_repo
>  2006  ci/build.py -p ubuntu_cpu /work/runtime_functions.sh build_ubuntu_cpu_mkldnn &&
ci/build.py --platform ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
> Best,
> Da
>
> On 5/4/18, 4:32 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com> wrote:
>
>     Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
>     DLAMI. I'm pretty confident it runs in most linux environments.
>
>     Can you post the exact commands that you run? is not clear to me what's the
>     problem from your paste. Please make sure your repo is clean and all your
>     subrepos are clean before starting the docker build.
>
>     ci/docker/runtime_functions.sh clean_repo
>
>     Pedro.
>
>     On Thu, May 3, 2018 at 7:17 PM, Zheng, Da <dzzhen@amazon.com> wrote:
>
>     > Hello Pedro,
>     >
>     > I tried your instructions. It seems I can't run the docker in EC2
>     > instances.
>     > Where did you reproduce the error?
>     >
>     > Thanks,
>     > Da
>     >
>     > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
>     > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
>     > gpg: directory `/root/.gnupg' created
>     > gpg: new configuration file `/root/.gnupg/gpg.conf' created
>     > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
>     > this run
>     > gpg: keyring `/root/.gnupg/secring.gpg' created
>     > gpg: keyring `/root/.gnupg/pubring.gpg' created
>     > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
>     > gpg: keyserver timed out
>     > gpg: keyserver receive failed: keyserver error
>     > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
>     > Traceback (most recent call last):
>     >   File "ci/build.py", line 263, in <module>
>     >     sys.exit(main())
>     >   File "ci/build.py", line 197, in main
>     >     build_docker(platform, docker_binary)
>     >   File "ci/build.py", line 73, in build_docker
>     >     check_call(cmd)
>     >   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
>     >     raise CalledProcessError(retcode, cmd)
>     > subprocess.CalledProcessError: Command '['docker', 'build', '-f',
>     > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
>     > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
>     >
>     >
>     > On 5/3/18, 8:01 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com> wrote:
>     >
>     >     Hi Da
>     >
>     >     Reproduction instructions:
>     >
>     >     On the host:
>     >
>     >     Adjust core pattern:
>     >
>     >     $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>     >
>     >
>     >     Use the following patch:
>     >
>     >     ===============
>     >
>     >     diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
>     >     --- a/3rdparty/mkldnn
>     >     +++ b/3rdparty/mkldnn
>     >     @@ -1 +1 @@
>     >     -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
>     >     +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
>     >     diff --git a/ci/docker/runtime_functions.sh
>     > b/ci/docker/runtime_functions.sh
>     >     index 027e287..62649c9 100755
>     >     --- a/ci/docker/runtime_functions.sh
>     >     +++ b/ci/docker/runtime_functions.sh
>     >     @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>     >          # https://github.com/apache/incubator-mxnet/issues/10026
>     >          #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>     >          export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
>     >     -    nosetests-2.7 --verbose tests/python/unittest
>     >     -    nosetests-2.7 --verbose tests/python/train
>     >     -    nosetests-2.7 --verbose tests/python/quantization
>     >     +    export MXNET_TEST_SEED=11
>     >     +    export MXNET_MODULE_SEED=812478194
>     >     +    pwd
>     >     +    export MXNET_TEST_COUNT=10000
>     >     +    ulimit -c unlimited
>     >     +    ulimit -c
>     >     +    while nosetests-2.7 --verbose
>     >     tests/python/unittest/test_module.py:test_forward_reshape; do echo
>     > round;
>     >     done
>     >     +    #nosetests-2.7 --verbose tests/python/train
>     >     +    #nosetests-2.7 --verbose tests/python/quantization
>     >      }
>     >
>     >      unittest_ubuntu_python3_cpu() {
>     >
>     >
>     >
>     >     ==============
>     >
>     >     Build and execute the test, make sure the repo is clean
>     >
>     >     $ ci/docker/runtime_functions.sh clean_repo
>     >
>     >     $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
>     >     build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
>     >     /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>     >
>     >
>     >     Once it crashes it will stop.
>     >
>     >     Then go in the container:
>     >
>     >
>     >     $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>     >
>     >     A core should be there.
>     >
>     >     you might need to install gdb as root by executing the previous command
>     >     without uid so you can use apt-get.
>     >
>     >
>     >
>     >
>     >     Good luck.
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >     On Thu, May 3, 2018 at 4:51 PM, Zheng, Da <dzzhen@amazon.com> wrote:
>     >
>     >     > Thanks a lot for locating the error.
>     >     > Could you tell me How you reproduce the error?
>     >     >
>     >     > On 5/3/18, 7:45 AM, "Pedro Larroy" <pedro.larroy.lists@gmail.com>
>     > wrote:
>     >     >
>     >     >     Looks like a problem in mkl's same_shape
>     >     >
>     >     >     the pointer to mkldnn::memory::desc &desc  looks invalid.
>     >     >
>     >     >     (More stack frames follow...)
>     >     >     (gdb) p desc
>     >     >     $1 = (const mkldnn::memory::desc &) @0x10: <error reading
>     > variable>
>     >     >     (gdb) p dtype
>     >     >     $2 = 0
>     >     >     (gdb) p shape
>     >     >     $3 = (const mxnet::TShape &) @0x7f3905a58b50:
>     > {<nnvm::Tuple<long>> =
>     >     >     {static kStackCache = <optimized out>, ndim_ = 2,
>     > num_heap_allocated_
>     >     > = 0,
>     >     >         data_stack_ = {20, 1, 139878025134112, 28}, data_heap_ =
>     > 0x0}, <No
>     >     > data
>     >     >     fields>}
>     >     >     (gdb)
>     >     >
>     >     >
>     >     >     On Thu, May 3, 2018 at 4:36 PM, Zheng, Da <dzzhen@amazon.com>
>     > wrote:
>     >     >
>     >     >     > There are a few problems with valgrind, which makes it not
an
>     > ideal
>     >     > tool
>     >     >     > for mxnet with python interface.
>     >     >     >
>     >     >     > First, valgrind generates a huge number of irrelevant
>     > messages, most
>     >     > of
>     >     >     > them from in Python itself.
>     >     >     >
>     >     >     > Second, valgrind can't emulate all CPU instructions. I
>     > remember that
>     >     > when
>     >     >     > I run valgrind with mxnet, valgrind exits with a strange
>     > error. I
>     >     > later on
>     >     >     > found that it was caused by an unsupported CPU instructions.
>     >     >     >
>     >     >     > Third, valgrind doesn't support multithreading well. As far
as
>     > I
>     >     > know,
>     >     >     > valgrind runs everything in a single thread even if the
>     > program uses
>     >     >     > multi-threading. An error like this, which is likely caused
by
>     > race
>     >     >     > condition, can't be caught by valgrind.
>     >     >     >
>     >     >     > I used to use Address Sanitizer for memory errors. This tool
>     > is much
>     >     >     > faster and can work with multi-threads. However, it doesn't
>     > work with
>     >     >     > Python for some reason.
>     >     >     >
>     >     >     > One thing we potentially can do is to use memory checker for
>     > C++ unit
>     >     >     > tests. Not sure it'll cover all memory errors we want.
>     >     >     >
>     >     >     > Best,
>     >     >     > Da
>     >     >     >
>     >     >     > On 5/3/18, 6:50 AM, "Pedro Larroy" <
>     > pedro.larroy.lists@gmail.com>
>     >     > wrote:
>     >     >     >
>     >     >     >     It's very difficult to reproduce, non-deterministic. We
>     > were also
>     >     >     > running
>     >     >     >     without signal handlers in CI so there are no stack traces
>     >     >     > unfortunately.
>     >     >     >
>     >     >     >     Care to elaborate why valgrind doesn't work with Python?
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >     On Thu, May 3, 2018 at 3:32 PM, Da Zheng <
>     > zhengda1936@gmail.com>
>     >     >     > wrote:
>     >     >     >
>     >     >     >     > can we build it in CI?segfault doesn't happen
>     > infrequently.
>     >     >     >     >
>     >     >     >     > 2018年5月2日 下午11:34,"Chris Olivier" <cjolivier01@gmail.com
>     > >写道:
>     >     >     >     >
>     >     >     >     > > you can try Intel Inspector, which is like an
enhanced
>     >     > version of
>     >     >     >     > valgrind
>     >     >     >     > > with a GUI and whatnot.
>     >     >     >     > >
>     >     >     >     > > On Wed, May 2, 2018 at 9:42 PM Da Zheng <
>     >     > zhengda1936@gmail.com>
>     >     >     > wrote:
>     >     >     >     > >
>     >     >     >     > > > valgrind doesn't work with Python. also,
valgrind
>     > doesn't
>     >     >     > support some
>     >     >     >     > > > CPU instructions used by MXNet (I think
some
>     > instructions
>     >     >     > related to
>     >     >     >     > > > random generator).
>     >     >     >     > > >
>     >     >     >     > > >
>     >     >     >     > > > On Wed, May 2, 2018 at 8:59 PM, Bhavin
Thaker <
>     >     >     > bhavinthaker@gmail.com>
>     >     >     >     > > > wrote:
>     >     >     >     > > > > Have you tried running with valgrind
to get some
>     > clues
>     >     > on the
>     >     >     >     > > root-cause?
>     >     >     >     > > > >
>     >     >     >     > > > > Bhavin Thaker.
>     >     >     >     > > > >
>     >     >     >     > > > > On Wed, May 2, 2018 at 8:55 PM Da
Zheng <
>     >     > zhengda1936@gmail.com
>     >     >     > >
>     >     >     >     > wrote:
>     >     >     >     > > > >
>     >     >     >     > > > >> It might also be possible that
this isn't an
>     > MKLDNN bug.
>     >     >     >     > > > >> I just saw a similar memory error
without MKLDNN
>     > build.
>     >     >     >     > > > >>
>     >     >     >     > > > >>
>     >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     >     > organizations/jenkins/
>     >     >     >     > > incubator-mxnet/detail/PR-10783/1/pipeline
>     >     >     >     > > > >>
>     >     >     >     > > > >> Best,
>     >     >     >     > > > >> Da
>     >     >     >     > > > >>
>     >     >     >     > > > >> On Wed, May 2, 2018 at 2:14 PM,
Zheng, Da <
>     >     > dzzhen@amazon.com>
>     >     >     >     > wrote:
>     >     >     >     > > > >> > There might be a race condition
that causes the
>     > memory
>     >     >     > error.
>     >     >     >     > > > >> > It might be caused by this
PR:
>     >     >     >     > > > >> > https://github.com/apache/
>     > incubator-mxnet/pull/10706/
>     >     > files
>     >     >     >     > > > >> > This PR removes MKLDNN memory
from NDArray.
>     >     >     >     > > > >> > However, I don't know why
this causes memory
>     > error. If
>     >     >     > someone is
>     >     >     >     > > > using
>     >     >     >     > > > >> the memory, it should still hold
the memory with
>     > shared
>     >     >     > pointer.
>     >     >     >     > > > >> > But I do see the memory error
increase after
>     > this PR
>     >     > is
>     >     >     > merged.
>     >     >     >     > > > >> >
>     >     >     >     > > > >> > Best,
>     >     >     >     > > > >> > Da
>     >     >     >     > > > >> >
>     >     >     >     > > > >> > On 5/2/18, 12:26 PM, "Pedro
Larroy" <
>     >     >     >     > pedro.larroy.lists@gmail.com>
>     >     >     >     > > > >> wrote:
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     I couldn't reproduce
locally with:
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     ci/build.py -p ubuntu_cpu
>     >     > /work/runtime_functions.sh
>     >     >     >     > > > >> >     build_ubuntu_cpu_mkldnn
&& ci/build.py
>     > --platform
>     >     >     > ubuntu_cpu
>     >     >     >     > > > >> >     /work/runtime_functions.sh
>     >     > unittest_ubuntu_python2_cpu
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     On Wed, May 2, 2018 at
8:50 PM, Pedro
>     > Larroy <
>     >     >     >     > > > >> pedro.larroy.lists@gmail.com>
>     >     >     >     > > > >> >     wrote:
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >     > Hi
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     > Seems master is
not running  anymore,
>     > there's a
>     >     >     > segmentation
>     >     >     >     > > > fault
>     >     >     >     > > > >> using
>     >     >     >     > > > >> >     > MKDLNN-CPU
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     >
>     >     >     >     > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     >     >     > organizations/jenkins/
>     >     >     >     > > > >> >     > incubator-mxnet/detail/master/
>     > 801/pipeline/662
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     > I see my PRs failing
with a similar error.
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >     > Pedro
>     >     >     >     > > > >> >     >
>     >     >     >     > > > >> >
>     >     >     >     > > > >> >
>     >     >     >     > > > >>
>     >     >     >     > > >
>     >     >     >     > >
>     >     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>     >
>
>

Mime
View raw message