mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lausen, Leonard" <lau...@amazon.com.INVALID>
Subject Re: Please remove conflicting Open MP version from CMake builds
Date Sat, 07 Dec 2019 16:32:52 GMT
Chris, if you can fix this in a small fraction of a time, please go ahead and do
so. Could you clarify why you think Intel's statement is nonsense or not
applicable? "Because different OpenMP runtimes may not be binary-compatible,
it's important to ensure that only one OpenMP runtime is used throughout the
application."

Why do we need to build LLVM OpenMP with GCC? If we want LLVM OpenMP, why not
just use LLVM compiler? The LLVM OpenMP in MXNet repo has currently not been
updated since 2 years. If we rely on the compiler version, we don't have to
maintain OpenMP in our repo.

Thank you
Leonard

On Sat, 2019-12-07 at 07:03 -0800, Chris Olivier wrote:
> -1
> 
> mkldnn removed omp5 for licencing issues
> no bugs have actually been traced to the use of llvm openmp. only an assert
> caused by an actual bug in mxnet code. there are suitable workarounds.
> 
> over time llvm omp has simply been used as a “catch all” for random
> problems that aren’t related at all (such as getenv race condition in an
> atfork call that isn’t even part of an omp parallel region).
> 
> proposal is now and has always been roughly equivalent to the idea of
> “comment out an assert rather than fix the bug it’s reporting”.
> 
> Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> and not libgomp, with no issue reported (omp not built in debug mode), so
> the equivalent configuration from CMake mysteriously causing myriads if
> problems has questionable merit and smells more like a hubris situation.
> 
> I use tensorflow as well and it links to libomp5 rather than libgomp.
> 
> if the assert problem is really a problem, the bug being reported would be
> prioritized and fixed. it should be fixed regardless. all the time spent by
> some CI people trying to remove this could have simply fixed the actual bug
> in a small fraction of the time.
> 
> 
> On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard <lausen@amazon.com.invalid>
> wrote:
> 
> > I think it's reasonable to assume that the Intel MKLDNN team is an
> > "authorative"
> > source about the issue of compilation with OpenMP and the OpenMP runtime
> > library
> > related issues. Thus I suggest we follow the recommendation of Intel
> > MKLDNN team
> > within the MXNet project.
> > 
> > Looking through the Intel MKLDNN documentation, I find [1]:
> > 
> > > DNNL uses OpenMP runtime library provided by the compiler.
> > 
> > as well as
> > 
> > > it's important to ensure that only one OpenMP runtime is used throughout
> > the
> > > application. Having more than one OpenMP runtime linked to an executable
> > may
> > > lead to undefined behavior including incorrect results or crashes.
> > 
> > To keep our project maintainable and error free, I thus suggest we follow
> > DNNL
> > and use the OpenMP runtime library provided by the compiler.
> > We have limited ressources and finding the root cause for any bugs
> > resulting
> > from linking multiple OpenMP libraries as currently done is, in my
> > opinion. not
> > a good use of time. We know it's due to undefined behavior and we know
> > it's best
> > practice to use OpenMP runtime library provided by the compiler. So let's
> > just
> > do that.
> > 
> > I think given that MKL-DNN has also adopted the "OpenMP runtime library
> > provided
> > by the compiler" approach, this issue is not contentious anymore and
> > qualifies
> > for lazy consensus.
> > 
> > Thus if there is no objection within 72 hours (lazy consensus), let's drop
> > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > droppeing the
> > bundled LLVM OpenMP, we can always add it back prior to the next release.
> > 
> > Best regards
> > Leonard
> > 
> > [1]:
> > 
> > https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp
> > (This is the updated reference from Anton's previous comment, based on the
> > changes in MKLDNN done in the meantime
> > https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-415078066
> > )
> > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> > 
> > 
> > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > I will try to stay on the sidelines for now since previous conversations
> > > about OMP have not been productive here and I have spent way too much
> > time
> > > on this already, I'm not the first one giving up on trying to help with
> > > this topic.
> > > 
> > > I would be glad if you guys can work together and find a solution. I will
> > > just put my understanding of the big picture hoping that it helps move it
> > > forward.
> > > 
> > > 
> > > Recently the intel omp library which seemed to have the best performance
> > of
> > > the 3 was removed from MKL.
> > > 
> > > - There's 3 libraries in play, GNU Omp which is shipped with gcc (gomp),
> > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which is
> > > recently removed (iomp)
> > > 
> > > - IOMP seems to have the best performance, there's stability issues
> > > producing crashes sometimes but the impact seems relatively small for
> > users
> > > and developers. In general seems linking with a different OMP version
> > that
> > > the one shipped with the compiler is known to cause stability issues but
> > > it's done anyway.
> > > 
> > > - LLVM-OMP used when building with CMake, not used in the PIP releases or
> > > when building with Make. Has stability issues, hangs when running in
> > debug
> > > mode during test execution and produces tons of assertions in debug mode.
> > > Might have some small performance gains but there is no clear cut data
> > that
> > > showcases significant performance gains.
> > > 
> > > - GOMP is the version shipped with GCC and the PIP wheels without MKL,
> > has
> > > no stability problems.
> > > 
> > > As a ballpark, IOMP might give 10% performance improvement in some cases.
> > > 
> > > We need to document well how users should tune and configure MXNet when
> > > using OMP.
> > > 
> > > As a developer, the safest bet is to use GOMP to be able to debug and
> > > develop without issues. As a user of CPU inference / training you want to
> > > run MKL so depends on how the Intel guys want to do things. My preference
> > > as an engineer is always stability > speed.
> > > 
> > > Related tickets:
> > > 
> > > https://github.com/apache/incubator-mxnet/issues/16891
> > > 
> > > 
> > https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-562637931
> > > 
> > > https://github.com/apache/incubator-mxnet/issues/11417
> > > 
> > > https://github.com/apache/incubator-mxnet/issues/15690
> > > 
> > > 
> > > 
> > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard
> > <lausen@amazon.com.invalid>
> > > wrote:
> > > 
> > > > Is this related to
> > https://github.com/apache/incubator-mxnet/issues/10856?
> > > > I unlocked that Github issue based on the Apache Code of Conduct
> > > > https://www.apache.org/foundation/policies/conduct#specific-guidelines
> > > > 
> > > > 
> > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$
ldd
> > > > > build/libmxnet.so| grep -i openmp
> > > > >         libomp.so =>
> > > > > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > (0x00007fde0991d000)
> > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$
> > python
> > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > --use-rec
> > > > > --batch-size 256 --dtype float32 --num-data-workers 40 --mode hybrid
> > > > > --model resnet50_v2 --use-pretrained --kvstore local --log-interval
1
> > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > ~/data/val-passthrough.idx
> > > > > INFO:root:Namespace(batch_norm=False, batch_size=256,
> > > > > data_dir='~/.mxnet/datasets/imagenet', dataset_size=32,
> > dtype='float32',
> > > > > kvstore='local', last_gamma=False, log_interval=1,
> > logging_dir='logs',
> > > > > lr=0.1, lr_decay=0.1, lr_decay_epoch='40,60', lr_mode='step',
> > > > > lr_poly_power=2, mode='hybrid', model='resnet50_v2', momentum=0.9,
> > > > > num_epochs=3, num_gpus=0, num_workers=40,
> > > > > rec_val='/home/piotr/data/val-passthrough.rec',
> > > > > rec_val_idx='/home/piotr/data/val-passthrough.idx',
> > save_dir='params',
> > > > > save_frequency=0, top_k=0, use_pretrained=True, use_rec=True,
> > > > use_se=False,
> > > > > warmup_epochs=0, warmup_lr=0.0, wd=0.0001)
> > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178:
> > ImageRecordIOParser2:
> > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for decoding..
> > > > > INFO:root:Batch [0]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
> > > > > INFO:root:Batch [1]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
> > > > > INFO:root:Batch [2]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > git st
> > > > > On branch upstream_master
> > > > > Your branch is up to date with 'origin/upstream_master'.
> > > > > 
> > > > > Changes not staged for commit:
> > > > >   (use "git add/rm <file>..." to update what will be committed)
> > > > >   (use "git checkout -- <file>..." to discard changes in working
> > > > directory)
> > > > >         deleted:    3rdparty/openmp
> > > > > 
> > > > > no changes added to commit (use "git add" and/or "git commit -a")
> > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > ldd build/libmxnet.so | grep -i omp
> > > > >         libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > (0x00007f941241c000)
> > > > > 
> > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > python
> > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > > > > --use-rec --batch-size 256 --dtype float32 --num-data-workers 40
> > --mode
> > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore local
> > > > --log-interval
> > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > ~/data/val-passthrough.idx
> > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117
> > > > > INFO:root:Batch [16]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760
> > > > > INFO:root:Batch [17]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033
Mime
View raw message