mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lausen, Leonard" <lau...@amazon.com.INVALID>
Subject Re: Please remove conflicting Open MP version from CMake builds
Date Mon, 09 Dec 2019 03:00:19 GMT
The assertion failure in the MXNet DEBUG build goes away by updating LLVM OpenMP
to the latest released version. All evidence I have points to the assertion
failure being due to a bug in the 2 years old UNRELEASED version of LLVM OpenMP.
that we are using currently in CMake builds.

Thus I'm requesting 3 commiters to approve 
https://github.com/apache/incubator-mxnet/pull/17012 to update to a released
version of LLVM OpenMP.

As described in the PR, the assertion is still part of LLVM OpenMP 9.0 codebase.
In particular look at lines 
https://github.com/llvm-mirror/openmp/blob/release_90/runtime/src/kmp_runtime.cpp#L6616
and 
https://github.com/llvm-mirror/openmp/blob/37c72127e90360a020f351f18d9cccfc30e5145a/runtime/src/kmp_runtime.cpp#L6481
where the latter is the line that currently fails in MXNet DEBUG build and the
former is the equivalent line that doesn't fail in MXNet DEBUG builds after
updating LLVM OpenMP.



There is also a crash with Intel OpenMP as well as both the old UNRELEASED and
the new, released version LLVM OpenMP that happens after forking. That crash
doesn't go away and needs to be root-caused 
https://github.com/apache/incubator-mxnet/issues/14979
 

On Sun, 2019-12-08 at 16:27 -0800, Pedro Larroy wrote:
> Hi Leonard.
> 
> Are you saying that you have updated this library and the problems desribed
> in the related tickets are no longer present?
> 
> P.
> 
> On Sunday, December 8, 2019, Lausen, Leonard <lausen@amazon.com.invalid>
> wrote:
> > Thanks Pedro and Chris for your responses.
> > 
> > After further investigation I find:
> > 
> > 1) I don't think https://github.com/apache/incubator-mxnet/issues/14979 is
> > caused by any incompatibility between gomp and llvm / intel omp. Rather
> it's
> > simply a problem of llvm / intel omp. See my comment to the issue for the
> > methodology to arrive at this claim.
> > 
> > 2) Regarding the assertion failure when compiling with (llvm)
> 3rdparty/openmp,
> > it can be fixed by updating the by now 2 years old llvm openmp code to the
> > newest released version. I went ahead and opened a PR
> > https://github.com/apache/incubator-mxnet/pull/17012
> > 
> > Based on the investigation described in 1), I think Chris is right that
> the
> > assertion failure is not due to some interaction between gomp and llvm
> omp.
> > However, I'm not sure about Chris's suggestion that the assertion fa

> > ilure
> is due
> > to a bug in MXNet. In fact, the failure goes away when updating the llvm
> openmp
> > code. So I think it's just due to a bug in the 2 years old code.
> > 
> > @Chris, I think updating 3rdparty/openmp to fix the assertion issue is not
> > contentious. Thus let's do it via lazy consensus (72 hours) or just
> approve the
> > PR and merge it.
> > 
> > Please also take a look at my comment at #14979 and let everyone know if
> you see
> > any option to fix the bug while keeping 3rdparty/openmp. As this bug
> affects an
> > important use-case, I beleive we need to remove 3rdparty/openmp from the
> CMake
> > build as long as we don't find a solution for making #14979 work with
> > 3rdparty/openmp.
> > 
> > In fact, removing 3rdparty/openmp will then match the current Makefile
> setup
> > that according to my understanding is used to build the nightly releases
> used by
> > the majority of developers. Ie. most users actually don't use the CMake
> build
> > with 3rdparty/openmp. You can consider rescinding your veto on removing
> > 3rdparty/openmp after reading through the evidence in that issue. If you
> don't
> > provide any evidence for why the methodology/conclusion in #14979 is
> flawed, I
> > will assume your previous veto is void based on Apache Voting rule as it
> lacks
> > technical justification and in any case was motivated by the assertion
> issue,
> > which I agree with you, is likely not due to gomp / omp interaction.
> > 
> > Thank you
> > Leonard
> > 
> > 
> > On Sat, 2019-12-07 at 15:40 -0800, Pedro Larroy wrote:
> > > Stop disseminating false information:
> > > 
> > > https://github.com/apache/incubator-mxnet/issues/14979
> > > 
> > > 
> > > On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier <cjolivier01@gmail.com>
> wrote:
> > > > -1
> > > > 
> > > > mkldnn removed omp5 for licencing issues
> > > > no bugs have actually been traced to the use of llvm openmp. only an
> assert
> > > > caused by an actual bug in mxnet code. there are suitable workarounds.
> > > > 
> > > > over time llvm omp has simply been used as a “catch all” for random
> > > > problems that aren’t related at all (such as getenv race condition in
> an
> > > > atfork call that isn’t even part of an omp parallel region).
> > > > 
> > > > proposal is now and has always been roughly equivalent to the idea of
> > > > “comment out an assert rather than fix the bug it’s reporting”.
> > > > 
> > > > Up until very recently, Makefile version of mxnet used libomp5 for
> YEARS
> > > > and not libgomp, with no issue reported (omp not built in debug mode),
> so
> > > > the equivalent configuration from CMake mysteriously causing myriads if
> > > > problems has questionable merit and smells more like a hubris
> situation.
> > > > I use tensorflow as well and it links to libomp5 rather than libgomp.
> > > > 
> > > > if the assert problem is really a problem, the bug being reported
> would be
> > > > prioritized and fixed. it should be fixed regardless. all the time
> spent by
> > > > some CI people trying to remove this could have simply fixed the
> actual bug
> > > > in a small fraction of the time.
> > > > 
> > > > 
> > > > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard
> <lausen@amazon.com.invalid>
> > > > wrote:
> > > > 
> > > > > I think it's reasonable to assume that the Intel MKLDNN team is an
> > > > > "authorative"
> > > > > source about the issue of compilation with OpenMP and the OpenMP
> runtime
> > > > > library
> > > > > related issues. Thus I suggest we follow the recommendation of Intel
> > > > > MKLDNN team
> > > > > within the MXNet project.
> > > > > 
> > > > > Looking through the Intel MKLDNN documentation, I find [1]:
> > > > > 
> > > > > > DNNL uses OpenMP runtime library provided by the compiler.
> > > > > 
> > > > > as well as
> > > > > 
> > > > > > it's important to ensure that only one OpenMP runtime is used
> > > > throughout
> > > > > the
> > > > > > application. Having more than one OpenMP runtime linked to an
> > > > executable
> > > > > may
> > > > > > lead to undefined behavior including incorrect results or crashes.
> > > > > 
> > > > > To keep our project maintainable and error free, I thus suggest we
> follow
> > > > > DNNL
> > > > > and use the OpenMP runtime library provided by the compiler.
> > > > > We have limited ressources and finding the root cause for any bugs
> > > > > resulting
> > > > > from linking multiple OpenMP libraries as currently done is, in my
> > > > > opinion. not
> > > > > a good use of time. We know it's due to undefined behavior and we
> know
> > > > > it's best
> > > > > practice to use OpenMP runtime library provided by the compiler.
So
> let's
> > > > > just
> > > > > do that.
> > > > > 
> > > > > I think given that MKL-DNN has also adopted the "OpenMP runtime
> library
> > > > > provided
> > > > > by the compiler" approach, this issue is not contentious anymore
and
> > > > > qualifies
> > > > > for lazy consensus.
> > > > > 
> > > > > Thus if there is no objection within 72 hours (lazy consensus), let's
> > > > drop
> > > > > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > > > > droppeing the
> > > > > bundled LLVM OpenMP, we can always add it back prior to the next
> release.
> > > > > Best regards
> > > > > Leonard
> > > > > 
> > > > > [1]:
> > > > > 
> > > > > 
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > > > > (This is the updated reference from Anton's previous comment, based
> on
> > > > the
> > > > > changes in MKLDNN done in the meantime
> > > > > 
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > > > > )
> > > > > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> > > > > 
> > > > > 
> > > > > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > > > > I will try to stay on the sidelines for now since previous
> > > > conversations
> > > > > > about OMP have not been productive here and I have spent way
too
> much
> > > > > time
> > > > > > on this already, I'm not the first one giving up on trying to
help
> with
> > > > > > this topic.
> > > > > > 
> > > > > > I would be glad if you guys can work together and find a solution.
> I
> > > > will
> > > > > > just put my understanding of the big picture hoping that it
helps
> move
> > > > it
> > > > > > forward.
> > > > > > 
> > > > > > 
> > > > > > Recently the intel omp library which seemed to have the best
> > > > performance
> > > > > of
> > > > > > the 3 was removed from MKL.
> > > > > > 
> > > > > > - There's 3 libraries in play, GNU Omp which is shipped with
gcc
> > > > (gomp),
> > > > > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL,
> which is
> > > > > > recently removed (iomp)
> > > > > > 
> > > > > > - IOMP seems to have the best performance, there's stability
issues
> > > > > > producing crashes sometimes but the impact seems relatively
small
> for
> > > > > users
> > > > > > and developers. In general seems linking with a different OMP
> version
> > > > > that
> > > > > > the one shipped with the compiler is known to cause stability
> issues
> > > > but
> > > > > > it's done anyway.
> > > > > > 
> > > > > > - LLVM-OMP used when building with CMake, not used in the PIP
> releases
> > > > or
> > > > > > when building with Make. Has stability issues, hangs when running
> in
> > > > > debug
> > > > > > mode during test execution and produces tons of assertions in
debug
> > > > mode.
> > > > > > Might have some small performance gains but there is no clear
cut
> data
> > > > > that
> > > > > > showcases significant performance gains.
> > > > > > 
> > > > > > - GOMP is the version shipped with GCC and the PIP wheels without
> MKL,
> > > > > has
> > > > > > no stability problems.
> > > > > > 
> > > > > > As a ballpark, IOMP might give 10% performance improvement in
some
> > > > cases.
> > > > > > We need to document well how users should tune and configure
MXNet
> when
> > > > > > using OMP.
> > > > > > 
> > > > > > As a developer, the safest bet is to use GOMP to be able to
debug
> and
> > > > > > develop without issues. As a user of CPU inference / training
you
> want
> > > > to
> > > > > > run MKL so depends on how the Intel guys want to do things.
My
> > > > preference
> > > > > > as an engineer is always stability > speed.
> > > > > > 
> > > > > > Related tickets:
> > > > > > 
> > > > > > https://github.com/apache/incubator-mxnet/issues/16891
> > > > > > 
> > > > > > 
> https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931
> > > > > > https://github.com/apache/incubator-mxnet/issues/11417
> > > > > > 
> > > > > > https://github.com/apache/incubator-mxnet/issues/15690
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard
> > > > > <lausen@amazon.com.invalid>
> > > > > > wrote:
> > > > > > 
> > > > > > > Is this related to
> > > > > https://github.com/apache/incubator-mxnet/issues/10856?
> > > > > > > I unlocked that Github issue based on the Apache Code of
Conduct
> > > > > > > 
> > > > https://www.apache.org/foundation/policies/conduct#specific-guidelines
> > > > > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6
> (upstream_master)+$
> > > > ldd
> > > > > > > > build/libmxnet.so| grep -i openmp
> > > > > > > >         libomp.so =>
> > > > > > > > 
> /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > > > > (0x00007fde0991d000)
> > > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6
> (upstream_master)+$
> > > > > python
> > > > > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > > > > --use-rec
> > > > > > > > --batch-size 256 --dtype float32 --num-data-workers
40 --mode
> > > > hybrid
> > > > > > > > --model resnet50_v2 --use-pretrained --kvstore local
> > > > --log-interval 1
> > > > > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > > > > ~/data/val-passthrough.idx
> > > > > > > > INFO:root:Namespace(batch_norm=False, batch_size=256,
> > > > > > > > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32,
> > > > > dtype='float32',
> > > > > > > > kvstore='local', last_gamma=False, log_interval=1,
> > > > > logging_dir='logs',
> > > > > > > > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
> > > > > > > > lr_poly_power=2, mode='hybrid', model='resnet50_v2',
> momentum=0.9,
> > > > > > > > num_epochs=3, num_gpus=0, num_workers=40,
> > > > > > > > rec_val='/home/piotr/data/val-passthrough.rec',
> > > > > > > > rec_val_idx='/home/piotr/data/val-passthrough.idx',
> > > > > save_dir='params',
> > > > > > > > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True,
> > > > > > > use_se=False,
> > > > > > > > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> > > > > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178:
> > > > > ImageRecordIOParser2:
> > > > > > > > /home/piotr/data/val-passthrough.rec, use 36 threads
for
> decoding..
> > > > > > > > INFO:root:Batch [0]
> > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time
> 43.150922
> > > > > > > > INFO:root:Batch [1]
> > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time
> 37.971927
> > > > > > > > INFO:root:Batch [2]
> > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time
> 35.755363
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> > > > > > > (upstream_master)+$
> > > > > > > > git st
> > > > > > > > On branch upstream_master
> > > > > > > > Your branch is up to date with 'origin/upstream_master'.
> > > > > > > > 
> > > > > > > > Changes not staged for commit:
> > > > > > > >   (use "git add/rm <file>..." to update what
will be committed)
> > > > > > > >   (use "git checkout -- <file>..." to discard
changes in
> working
> > > > > > > directory)
> > > > > > > >         deleted:    3rdparty/openmp
> > > > > > > > 
> > > > > > > > no changes added to commit (use "git add" and/or "git
commit
> -a")
> > > > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp
> > > > > > > (upstream_master)+$
> > > > > > > > ldd build/libmxnet.so | grep -i omp
> > > > > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > > > > (0x00007f941241c000)
> > > > > > > > 
> > > > > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp
> > > > > > > (upstream_master)+$
> > > > > > > > python
> > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > > > > > > > --use-rec --batch-size 256 --dtype float32 --num-data-workers
> 40
> > > > > --mode
> > > > > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore
local
> > > > > > > --log-interval
> > > > > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > > > > ~/data/val-passthrough.idx
> > > > > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time
> 1.735117
> > > > > > > > INFO:root:Batch [16]
> > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time
> 1.785760
> > > > > > > > INFO:root:Batch [17]
> > > > > > > > INFO:root:Top 1 accuracy: 0
> > > > > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time
> 1.729033
Mime
View raw message