mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lausen, Leonard" <lau...@amazon.com.INVALID>
Subject Re: [VOTE] Release Apache MXNet (incubating) version 1.6.0.rc2
Date Tue, 04 Feb 2020 23:04:59 GMT
Using latest upstream jemalloc 
https://github.com/leezu/mxnet/commit/fd4c78a635087f6164344da53a55ba2b67da2fd2
fixes the issue. 

However, there were concerns that this commit relies on unreleased development
features of jemalloc (jemalloc cmake build system support) and we'll not merge
this commit until upstream releases cmake build system support in a release.

In the meantime anyone is welcome to work on an equivalent patch based on the
custom build system in latest stable jemalloc. 

On Tue, 2020-02-04 at 22:46 +0000, Lausen, Leonard wrote:
> Bisect identifies 
> https://github.com/apache/incubator-mxnet/commit/425319cb59904573bd3fe1b6fe0a7381eceb9bbd
> 
> Thus this is an issue with jemalloc + llvm libopemnp.
> 
> The correct reproducer for latest master branch is
> 
> 
>   git clone --recursive https://github.com/apache/incubator-mxnet/ mxnet
>   cd mxnet
>   git checkout a726c406964b9cd17efa826738a662e09d973972 # workaround 
> https://github.com/apache/incubator-mxnet/issues/17514
>   mkdir build; cd build;
>   cmake -DUSE_CPP_PACKAGE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
> -DUSE_CUDA=OFF -DUSE_JEMALLOC=ON ..
>   ninja
>   ./cpp-package/example/test_regress_label  # run a 2-3 times to reproduce
> 
> Let's move the discussion to about fixing the jemalloc, openmp incompatibility
> to https://github.com/apache/incubator-mxnet/issues/17043 
> 
> 
> 
> @Chris, could you look into this issue as it only happens with LLVM OpenMP?
> 
> 
> 
> @Przemek: For 1.6.0 releas notes I suggest include recommendation to set
> USE_JEMALLOC=OFF when compiling from source.
> 
> This note should probably be added in any case, as building with
> USE_JEMALLOC=ON
> is broken on Ubuntu Ubuntu 18.10 and higher, as well as Debian Stable.
> 
> Given these release notes, +1 for the release.
> 
> 
> Best regards
> Leonard
> 
> On Tue, 2020-02-04 at 22:26 +0000, Lausen, Leonard wrote:
> > Actually below reproducer is wrong. The issue was apparently fixed on master
> > recently. I'm running an automated bisect and will report the result later.
> > 
> > On Tue, 2020-02-04 at 21:44 +0000, Lausen, Leonard wrote:
> > > Hi Chris,
> > > 
> > > you previously found and fixed a OMP race condition during fork at 
> > > https://github.com/apache/incubator-mxnet/pull/17039
> > > 
> > > This time no forks are involved. Could you run the following reproducer on
> > > master branch:
> > > 
> > >   git clone --recursive https://github.com/apache/incubator-mxnet/ mxnet
> > >   cd mxnet
> > >   git checkout a726c406964b9cd17efa826738a662e09d973972 # workaround 
> > > https://github.com/apache/incubator-mxnet/issues/17514
> > >   mkdir build; cd build;
> > >   cmake -DUSE_CPP_PACKAGE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -GNinja
> > > -DUSE_CUDA=OFF ..
> > >   ninja
> > >   ./cpp-package/example/test_regress_label  # run a 2-3 times to reproduce
> > > 
> > > 
> > > As you are OpenMP expert, you may be able to identify the root cause withe
> > > relative ease.
> > > 
> > > Thank you,
> > > 
> > > Leonard
> > > 
> > > On Tue, 2020-02-04 at 11:06 -0800, Chris Olivier wrote:
> > > > When "fixing", please "fix" through actual root-cause analysis (use gdb,
> > > > for instance) and not simply by guesswork and cutting out things which
> > > > probably aren't actually at fault (blaming an OMP library that's in
> > > > worldwide distribution int he billions should be treated with great
> > > > skepticism).
> > > > 
> > > > On Tue, Feb 4, 2020 at 10:44 AM Lin Yuan <apeforest@gmail.com> wrote:
> > > > 
> > > > > Pedro,
> > > > > 
> > > > > While I agree with you we need to fix this usability issue, I don't
> > > > > think
> > > > > this is a release blocker as Przemek mentioned above. Could we fix
> > > > > this
> > > > > in
> > > > > the next minor release?
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Lin
> > > > > 
> > > > > On Tue, Feb 4, 2020 at 10:38 AM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > wrote:
> > > > > 
> > > > > > Right. Would it be possible to have the CMake build also use
libgomp
> > > > > > for
> > > > > > consistency with the releases until these issues are resolved?
> > > > > > This can affect anyone compiling the distribution with CMake
and
> > > > > > also
> > > > > > happens randomly in CI, worsening the contributor experience
due to
> > > > > > CI
> > > > > > failures.
> > > > > > 
> > > > > > On Tue, Feb 4, 2020 at 9:33 AM Przemysław Trędak <ptrendx@apache.org
> > > > > > >
> > > > > > wrote:
> > > > > > 
> > > > > > > Hi Pedro,
> > > > > > > 
> > > > > > > From the issue that you linked it seems that you are using
the
> > > > > > > LLVM
> > > > > > > OpenMP, whereas I believe the actual release uses libgomp
(at
> > > > > > > least
> > > > > > that's
> > > > > > > what seems to be the conclusion from this issue:
> > > > > > > https://github.com/apache/incubator-mxnet/issues/16891)?
> > > > > > > 
> > > > > > > Przemek
> > > > > > > 
> > > > > > > On 2020/02/04 03:42:30, Pedro Larroy <pedro.larroy.lists@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > > > -1
> > > > > > > > 
> > > > > > > > Unit tests passed in CPU build.
> > > > > > > > 
> > > > > > > > I observe crashes related to openmp using cpp unit
tests:
> > > > > > > > 
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/17043
> > > > > > > > 
> > > > > > > > Pedro.
> > > > > > > > 
> > > > > > > > On Mon, Feb 3, 2020 at 6:44 PM Chaitanya Bapat <
> > > > > > > > chai.bapat@gmail.com
> > > > > > > wrote:
> > > > > > > > > +1
> > > > > > > > > Successfully built MXNet 1.6.0rc2 on Linux
> > > > > > > > > Tested for OpPerf utility
> > > > > > > > > For CPU -
> > > > > > > > > 
> > > > > https://gist.github.com/ChaiBapchya/d5ecc3e971c5a3c558d672477b4b6b9c
> > > > > > > > > Works well!
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On Mon, 3 Feb 2020 at 15:43, Lin Yuan <apeforest@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > > +1
> > > > > > > > > > 
> > > > > > > > > > Tested Horovod with mnist example. My compiler
flags are
> > > > > > > > > > below:
> > > > > > > > > > 
> > > > > > > > > > [✔ CUDA, ✔ CUDNN, ✔ NCCL, ✔ CUDA_RTC,
✖ TENSORRT, ✔ CPU_SSE,
> > > > > > > > > > ✔
> > > > > > > CPU_SSE2,
> > > > > > > > > ✔
> > > > > > > > > > CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2,
✖ CPU_SSE4A, ✔
> > > > > > > > > > CPU_AVX,
> > > > > > > > > > ✖
> > > > > > > > > CPU_AVX2, ✔
> > > > > > > > > > OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC,
✔ BLAS_OPEN, ✖
> > > > > > > > > > BLAS_ATLAS,
> > > > > > > > > > ✖
> > > > > > > > > BLAS_MKL, ✖
> > > > > > > > > > BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✔
OPENCV, ✖ CAFFE, ✖
> > > > > > > > > > PROFILER,
> > > > > > > > > > ✔
> > > > > > > > > > DIST_KVSTORE, ✖ CXX14, ✖ INT64_TENSOR_SIZE,
✖
> > > > > > > > > > SIGNAL_HANDLER,
> > > > > > > > > > ✖
> > > > > > > DEBUG, ✖
> > > > > > > > > > TVM_OP]
> > > > > > > > > > 
> > > > > > > > > > Lin
> > > > > > > > > > 
> > > > > > > > > > On Sat, Feb 1, 2020 at 9:55 PM Tao Lv <taolv@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > > +1
> > > > > > > > > > > 
> > > > > > > > > > > I tested below items:
> > > > > > > > > > > 1. download artifacts from Apache dist
repo;
> > > > > > > > > > > 2. the signature looks good;
> > > > > > > > > > > 3. build from source code with MKL-DNN
and MKL on centos;
> > > > > > > > > > > 4. run fp32 and int8 inference of ResNet50
under
> > > > > > > > > /example/quantization/.
> > > > > > > > > > > thanks,
> > > > > > > > > > > -tao
> > > > > > > > > > > 
> > > > > > > > > > > On Sun, Feb 2, 2020 at 11:00 AM Tao
Lv <taolv@apache.org>
> > > > > wrote:
> > > > > > > > > > > > I see. I was looking at this page:
> > > > > > > > > > > > 
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > > > > > > > > On Sun, Feb 2, 2020 at 4:54 AM
Przemysław Trędak <
> > > > > > > ptrendx@apache.org
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Tao,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Could you tell me where did
you look for it and did
> > > > > > > > > > > > > not
> > > > > > > > > > > > > find
> > > > > > > it? I
> > > > > > > > > > just
> > > > > > > > > > > > > checked and both
> > > > > > > > > > > > > 
> > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > > > > > > > and
> > > > > > > > > > > > > draft of the release on GitHub
have them.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thank you
> > > > > > > > > > > > > Przemek
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On 2020/02/01 14:23:11, Tao
Lv <taolv@apache.org>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > It seems the src tar
and signature are missing from
> > > > > > > > > > > > > > the
> > > > > tag.
> > > > > > > > > > > > > > On Fri, Jan 31, 2020
at 11:09 AM Przemysław Trędak <
> > > > > > > > > > > ptrendx@apache.org>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This is the vote
to release Apache MXNet
> > > > > > > > > > > > > > > (incubating)
> > > > > > > version
> > > > > > > > > > 1.6.0.
> > > > > > > > > > > > > > > Voting starts today
and will close on Monday
> > > > > > > > > > > > > > > 2/3/2020
> > > > > > 23:59
> > > > > > > PST.
> > > > > > > > > > > > > > > Link to release
notes:
> > > > > > > > > > > > > > > 
> > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.6.0+Release+notes
> > > > > > > > > > > > > > > Link to release
candidate:
> > > > > > > > > > > > > > > 
> > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.6.0.rc2
> > > > > > > > > > > > > > > Link to source
and signatures on apache dist
> > > > > > > > > > > > > > > server:
> > > > > > > > > > > > > > > 
> > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.6.0.rc2/
> > > > > > > > > > > > > > > The differences
comparing to previous release
> > > > > > > > > > > > > > > candidate
> > > > > > > > > 1.6.0.rc1:
> > > > > > > > > > > > > > >  * Fixes for license
issues (#17361, #17375,
> > > > > > > > > > > > > > > #17370,
> > > > > > #17460)
> > > > > > > > > > > > > > >  * Bugfix for saving
LSTM layer parameter (#17288)
> > > > > > > > > > > > > > >  * Bugfix for downloading
the model from model zoo
> > > > > > > > > > > > > > > from
> > > > > > > multiple
> > > > > > > > > > > > > processes
> > > > > > > > > > > > > > > (#17372)
> > > > > > > > > > > > > > >  * Fixed a symbol.py
in AMP for GluonNLP (#17408)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Please remember
to TEST first before voting
> > > > > > > > > > > > > > > accordingly:
> > > > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > > > -1 = disapprove
(provide reason)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > Przemyslaw Tredak
> > > > > > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > *Chaitanya Prakash Bapat*
> > > > > > > > > *+1 (973) 953-6299*
> > > > > > > > > 
> > > > > > > > > [image: https://www.linkedin.com//in/chaibapat25]
> > > > > > > > > <https://github.com/ChaiBapchya>[image:
> > > > > > > https://www.facebook.com/chaibapat
> > > > > > > > > ]
> > > > > > > > > <https://www.facebook.com/chaibapchya>[image:
> > > > > > > > > https://twitter.com/ChaiBapchya] <
> > > > > > > > > https://twitter.com/ChaiBapchya
> > > > > > > > [image:
> > > > > > > > > https://www.linkedin.com//in/chaibapat25]
> > > > > > > > > <https://www.linkedin.com//in/chaibapchya/>
> > > > > > > > > 
Mime
View raw message