mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Larroy <pedro.larroy.li...@gmail.com>
Subject Re: Please remove conflicting Open MP version from CMake builds
Date Sat, 07 Dec 2019 23:40:30 GMT
Stop disseminating false information:

https://github.com/apache/incubator-mxnet/issues/14979


On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier <cjolivier01@gmail.com> wrote:

> -1
>
> mkldnn removed omp5 for licencing issues
> no bugs have actually been traced to the use of llvm openmp. only an assert
> caused by an actual bug in mxnet code. there are suitable workarounds.
>
> over time llvm omp has simply been used as a “catch all” for random
> problems that aren’t related at all (such as getenv race condition in an
> atfork call that isn’t even part of an omp parallel region).
>
> proposal is now and has always been roughly equivalent to the idea of
> “comment out an assert rather than fix the bug it’s reporting”.
>
> Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> and not libgomp, with no issue reported (omp not built in debug mode), so
> the equivalent configuration from CMake mysteriously causing myriads if
> problems has questionable merit and smells more like a hubris situation.
>
> I use tensorflow as well and it links to libomp5 rather than libgomp.
>
> if the assert problem is really a problem, the bug being reported would be
> prioritized and fixed. it should be fixed regardless. all the time spent by
> some CI people trying to remove this could have simply fixed the actual bug
> in a small fraction of the time.
>
>
> On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard <lausen@amazon.com.invalid>
> wrote:
>
> > I think it's reasonable to assume that the Intel MKLDNN team is an
> > "authorative"
> > source about the issue of compilation with OpenMP and the OpenMP runtime
> > library
> > related issues. Thus I suggest we follow the recommendation of Intel
> > MKLDNN team
> > within the MXNet project.
> >
> > Looking through the Intel MKLDNN documentation, I find [1]:
> >
> > > DNNL uses OpenMP runtime library provided by the compiler.
> >
> > as well as
> >
> > > it's important to ensure that only one OpenMP runtime is used
> throughout
> > the
> > > application. Having more than one OpenMP runtime linked to an
> executable
> > may
> > > lead to undefined behavior including incorrect results or crashes.
> >
> > To keep our project maintainable and error free, I thus suggest we follow
> > DNNL
> > and use the OpenMP runtime library provided by the compiler.
> > We have limited ressources and finding the root cause for any bugs
> > resulting
> > from linking multiple OpenMP libraries as currently done is, in my
> > opinion. not
> > a good use of time. We know it's due to undefined behavior and we know
> > it's best
> > practice to use OpenMP runtime library provided by the compiler. So let's
> > just
> > do that.
> >
> > I think given that MKL-DNN has also adopted the "OpenMP runtime library
> > provided
> > by the compiler" approach, this issue is not contentious anymore and
> > qualifies
> > for lazy consensus.
> >
> > Thus if there is no objection within 72 hours (lazy consensus), let's
> drop
> > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > droppeing the
> > bundled LLVM OpenMP, we can always add it back prior to the next release.
> >
> > Best regards
> > Leonard
> >
> > [1]:
> >
> >
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > (This is the updated reference from Anton's previous comment, based on
> the
> > changes in MKLDNN done in the meantime
> >
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > )
> > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> >
> >
> > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > I will try to stay on the sidelines for now since previous
> conversations
> > > about OMP have not been productive here and I have spent way too much
> > time
> > > on this already, I'm not the first one giving up on trying to help with
> > > this topic.
> > >
> > > I would be glad if you guys can work together and find a solution. I
> will
> > > just put my understanding of the big picture hoping that it helps move
> it
> > > forward.
> > >
> > >
> > > Recently the intel omp library which seemed to have the best
> performance
> > of
> > > the 3 was removed from MKL.
> > >
> > > - There's 3 libraries in play, GNU Omp which is shipped with gcc
> (gomp),
> > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
> > > recently removed (iomp)
> > >
> > > - IOMP seems to have the best performance, there's stability issues
> > > producing crashes sometimes but the impact seems relatively small for
> > users
> > > and developers. In general seems linking with a different OMP version
> > that
> > > the one shipped with the compiler is known to cause stability issues
> but
> > > it's done anyway.
> > >
> > > - LLVM-OMP used when building with CMake, not used in the PIP releases
> or
> > > when building with Make. Has stability issues, hangs when running in
> > debug
> > > mode during test execution and produces tons of assertions in debug
> mode.
> > > Might have some small performance gains but there is no clear cut data
> > that
> > > showcases significant performance gains.
> > >
> > > - GOMP is the version shipped with GCC and the PIP wheels without MKL,
> > has
> > > no stability problems.
> > >
> > > As a ballpark, IOMP might give 10% performance improvement in some
> cases.
> > >
> > > We need to document well how users should tune and configure MXNet when
> > > using OMP.
> > >
> > > As a developer, the safest bet is to use GOMP to be able to debug and
> > > develop without issues. As a user of CPU inference / training you want
> to
> > > run MKL so depends on how the Intel guys want to do things. My
> preference
> > > as an engineer is always stability > speed.
> > >
> > > Related tickets:
> > >
> > > https://github.com/apache/incubator-mxnet/issues/16891
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931
> > >
> > >
> > > https://github.com/apache/incubator-mxnet/issues/11417
> > >
> > > https://github.com/apache/incubator-mxnet/issues/15690
> > >
> > >
> > >
> > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard
> > <lausen@amazon.com.invalid>
> > > wrote:
> > >
> > > > Is this related to
> > https://github.com/apache/incubator-mxnet/issues/10856?
> > > >
> > > > I unlocked that Github issue based on the Apache Code of Conduct
> > > >
> https://www.apache.org/foundation/policies/conduct#specific-guidelines
> > > >
> > > >
> > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$
> ldd
> > > > > build/libmxnet.so| grep -i openmp
> > > > >         libomp.so =>
> > > > > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > (0x00007fde0991d000)
> > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$
> > python
> > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > --use-rec
> > > > > --batch-size 256 --dtype float32 --num-data-workers 40 --mode
> hybrid
> > > > > --model resnet50_v2 --use-pretrained --kvstore local
> --log-interval 1
> > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > ~/data/val-passthrough.idx
> > > > > INFO:root:Namespace(batch_norm=False, batch_size=256,
> > > > > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32,
> > dtype='float32',
> > > > > kvstore='local', last_gamma=False, log_interval=1,
> > logging_dir='logs',
> > > > > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
> > > > > lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
> > > > > num_epochs=3, num_gpus=0, num_workers=40,
> > > > > rec_val='/home/piotr/data/val-passthrough.rec',
> > > > > rec_val_idx='/home/piotr/data/val-passthrough.idx',
> > save_dir='params',
> > > > > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True,
> > > > use_se=False,
> > > > > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178:
> > ImageRecordIOParser2:
> > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
> > > > > INFO:root:Batch [0]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
> > > > > INFO:root:Batch [1]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
> > > > > INFO:root:Batch [2]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > git st
> > > > > On branch upstream_master
> > > > > Your branch is up to date with 'origin/upstream_master'.
> > > > >
> > > > > Changes not staged for commit:
> > > > >   (use "git add/rm <file>..." to update what will be committed)
> > > > >   (use "git checkout -- <file>..." to discard changes in working
> > > > directory)
> > > > >         deleted:    3rdparty/openmp
> > > > >
> > > > > no changes added to commit (use "git add" and/or "git commit -a")
> > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > ldd build/libmxnet.so | grep -i omp
> > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > (0x00007f941241c000)
> > > > >
> > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > python
> > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > > > > --use-rec --batch-size 256 --dtype float32 --num-data-workers 40
> > --mode
> > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore local
> > > > --log-interval
> > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > ~/data/val-passthrough.idx
> > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117
> > > > > INFO:root:Batch [16]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760
> > > > > INFO:root:Batch [17]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message