From dev-return-6999-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Sat Dec 7 23:40:54 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5A4BA18065B for ; Sun, 8 Dec 2019 00:40:54 +0100 (CET) Received: (qmail 63060 invoked by uid 500); 7 Dec 2019 23:40:53 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 63047 invoked by uid 99); 7 Dec 2019 23:40:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Dec 2019 23:40:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B25BE1A3497 for ; Sat, 7 Dec 2019 23:40:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id KPtc2b47-Hnd for ; Sat, 7 Dec 2019 23:40:50 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.195; helo=mail-lj1-f195.google.com; envelope-from=pedro.larroy.lists@gmail.com; receiver= Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id CE571BC509 for ; Sat, 7 Dec 2019 23:40:49 +0000 (UTC) Received: by mail-lj1-f195.google.com with SMTP id r19so11583526ljg.3 for ; Sat, 07 Dec 2019 15:40:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=BdccZKrB1TokL09oPfwdqTEvo7J6Zfs148PnjQe3O3A=; b=dhhP0beE4dXpzvZ/Z8cgyv3jrlYcejiuO6QihWktUmzMC16JoKRg5flu6yJJDYX1Kr r3DVyGDZM5GZb2emDZoetDSUuXj04LUrFl5aMQpFQiY4PHdexComQGKfotvnUMPV2qAv Z5+DPnL4ojTfJ9Mq2eTcyPwuzP+umx5xgfObmCafTaDoPnKa6aLUSYGEkO8PTZegvBsR TcaQjil97LJE/0Ttkloum6Qcv87ppiS2slV2YRO17tz2EUf5K2PQPhs3nRUBY+AdMlmz dKs0e32oKz8paYeJ4R2rdxjmKA40p6wr14OfDxdHz2py628Z99K763tXlEmRQn9fD938 eDCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=BdccZKrB1TokL09oPfwdqTEvo7J6Zfs148PnjQe3O3A=; b=F5rwhX/Uhf08+GKj1BEtHMOoLW7sYNIuYaD1YkVwDs6EbDdkwrv7yyZaqGQmNQmF84 tF9tjjl5Y/Q7KLlovVfTFJGvT67eFHhWoxbY+z+prhvtL45XzSvR8LGTDBUGbD/CKT6v r1U83Yj6E58GeWZW8rlFXMhCiD7qmLSyT2E1QTxUB+4zLD+mlu0bpMvyxNSUJkGqEI1l S2jMi+6/WCA15KTX5MbgF1hWYg/OjpNOhsZ/iLMkhrVpyIWDvmwAS+cZ5LHH4MXSndem SFLMv72oBFgpirf4IqCqKqwmKC+xZgSMOGTkVAJdSMkP3iTvAM9GVQdsbkYsEqb8TOVQ ed7Q== X-Gm-Message-State: APjAAAWOLiuYb0cSiIShKjQrEokdmMqaI+g3EhQb57v7ykhrW1IfzVZm nOJtiMtECXc7dbQ2EU8b1F/pBAW04Lljpblfdl/Gex9s9OY= X-Google-Smtp-Source: APXvYqzD+zzrvudiSu6Ck5+W/Xq9rM+z4gGK47DxMOeVkPpfrfsfhDrNdwQZnCLN1W0LcWsWSDnbRqeTrNs/sMSOIlQ= X-Received: by 2002:a2e:894b:: with SMTP id b11mr12630713ljk.118.1575762042948; Sat, 07 Dec 2019 15:40:42 -0800 (PST) MIME-Version: 1.0 References: <48beda77473f21bcce106b29cdbfbb27ff666d00.camel@amazon.com> In-Reply-To: From: Pedro Larroy Date: Sat, 7 Dec 2019 15:40:30 -0800 Message-ID: Subject: Re: Please remove conflicting Open MP version from CMake builds To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="000000000000d20fcb059925b0f5" --000000000000d20fcb059925b0f5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Stop disseminating false information: https://github.com/apache/incubator-mxnet/issues/14979 On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier wrote: > -1 > > mkldnn removed omp5 for licencing issues > no bugs have actually been traced to the use of llvm openmp. only an asse= rt > caused by an actual bug in mxnet code. there are suitable workarounds. > > over time llvm omp has simply been used as a =E2=80=9Ccatch all=E2=80=9D = for random > problems that aren=E2=80=99t related at all (such as getenv race conditio= n in an > atfork call that isn=E2=80=99t even part of an omp parallel region). > > proposal is now and has always been roughly equivalent to the idea of > =E2=80=9Ccomment out an assert rather than fix the bug it=E2=80=99s repor= ting=E2=80=9D. > > Up until very recently, Makefile version of mxnet used libomp5 for YEARS > and not libgomp, with no issue reported (omp not built in debug mode), so > the equivalent configuration from CMake mysteriously causing myriads if > problems has questionable merit and smells more like a hubris situation. > > I use tensorflow as well and it links to libomp5 rather than libgomp. > > if the assert problem is really a problem, the bug being reported would b= e > prioritized and fixed. it should be fixed regardless. all the time spent = by > some CI people trying to remove this could have simply fixed the actual b= ug > in a small fraction of the time. > > > On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard > wrote: > > > I think it's reasonable to assume that the Intel MKLDNN team is an > > "authorative" > > source about the issue of compilation with OpenMP and the OpenMP runtim= e > > library > > related issues. Thus I suggest we follow the recommendation of Intel > > MKLDNN team > > within the MXNet project. > > > > Looking through the Intel MKLDNN documentation, I find [1]: > > > > > DNNL uses OpenMP runtime library provided by the compiler. > > > > as well as > > > > > it's important to ensure that only one OpenMP runtime is used > throughout > > the > > > application. Having more than one OpenMP runtime linked to an > executable > > may > > > lead to undefined behavior including incorrect results or crashes. > > > > To keep our project maintainable and error free, I thus suggest we foll= ow > > DNNL > > and use the OpenMP runtime library provided by the compiler. > > We have limited ressources and finding the root cause for any bugs > > resulting > > from linking multiple OpenMP libraries as currently done is, in my > > opinion. not > > a good use of time. We know it's due to undefined behavior and we know > > it's best > > practice to use OpenMP runtime library provided by the compiler. So let= 's > > just > > do that. > > > > I think given that MKL-DNN has also adopted the "OpenMP runtime library > > provided > > by the compiler" approach, this issue is not contentious anymore and > > qualifies > > for lazy consensus. > > > > Thus if there is no objection within 72 hours (lazy consensus), let's > drop > > bundled LLVM OpenMP from master [2]. If we find any issues due to > > droppeing the > > bundled LLVM OpenMP, we can always add it back prior to the next releas= e. > > > > Best regards > > Leonard > > > > [1]: > > > > > https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b= 1921d/doc/build/build_options.md#openmp > > (This is the updated reference from Anton's previous comment, based on > the > > changes in MKLDNN done in the meantime > > > https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-4150780= 66 > > ) > > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160 > > > > > > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote: > > > I will try to stay on the sidelines for now since previous > conversations > > > about OMP have not been productive here and I have spent way too much > > time > > > on this already, I'm not the first one giving up on trying to help wi= th > > > this topic. > > > > > > I would be glad if you guys can work together and find a solution. I > will > > > just put my understanding of the big picture hoping that it helps mov= e > it > > > forward. > > > > > > > > > Recently the intel omp library which seemed to have the best > performance > > of > > > the 3 was removed from MKL. > > > > > > - There's 3 libraries in play, GNU Omp which is shipped with gcc > (gomp), > > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which i= s > > > recently removed (iomp) > > > > > > - IOMP seems to have the best performance, there's stability issues > > > producing crashes sometimes but the impact seems relatively small for > > users > > > and developers. In general seems linking with a different OMP version > > that > > > the one shipped with the compiler is known to cause stability issues > but > > > it's done anyway. > > > > > > - LLVM-OMP used when building with CMake, not used in the PIP release= s > or > > > when building with Make. Has stability issues, hangs when running in > > debug > > > mode during test execution and produces tons of assertions in debug > mode. > > > Might have some small performance gains but there is no clear cut dat= a > > that > > > showcases significant performance gains. > > > > > > - GOMP is the version shipped with GCC and the PIP wheels without MKL= , > > has > > > no stability problems. > > > > > > As a ballpark, IOMP might give 10% performance improvement in some > cases. > > > > > > We need to document well how users should tune and configure MXNet wh= en > > > using OMP. > > > > > > As a developer, the safest bet is to use GOMP to be able to debug and > > > develop without issues. As a user of CPU inference / training you wan= t > to > > > run MKL so depends on how the Intel guys want to do things. My > preference > > > as an engineer is always stability > speed. > > > > > > Related tickets: > > > > > > https://github.com/apache/incubator-mxnet/issues/16891 > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-56263= 7931 > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/11417 > > > > > > https://github.com/apache/incubator-mxnet/issues/15690 > > > > > > > > > > > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard > > > > > wrote: > > > > > > > Is this related to > > https://github.com/apache/incubator-mxnet/issues/10856? > > > > > > > > I unlocked that Github issue based on the Apache Code of Conduct > > > > > https://www.apache.org/foundation/policies/conduct#specific-guidelines > > > > > > > > > > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote: > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$ > ldd > > > > > build/libmxnet.so| grep -i openmp > > > > > libomp.so =3D> > > > > > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so > > > > > (0x00007fde0991d000) > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$ > > python > > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py > > --use-rec > > > > > --batch-size 256 --dtype float32 --num-data-workers 40 --mode > hybrid > > > > > --model resnet50_v2 --use-pretrained --kvstore local > --log-interval 1 > > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx > > > > > ~/data/val-passthrough.idx > > > > > INFO:root:Namespace(batch_norm=3DFalse, batch_size=3D256, > > > > > data_dir=3D'~/.mxnet/datasets/imagenet', dataset_size=3D32, > > dtype=3D'float32', > > > > > kvstore=3D'local', last_gamma=3DFalse, log_interval=3D1, > > logging_dir=3D'logs', > > > > > lr=3D0.1, lr_decay=3D0.1, lr_decay_epoch=3D'40,60', lr_mode=3D'st= ep', > > > > > lr_poly_power=3D2, mode=3D'hybrid', model=3D'resnet50_v2', moment= um=3D0.9, > > > > > num_epochs=3D3, num_gpus=3D0, num_workers=3D40, > > > > > rec_val=3D'/home/piotr/data/val-passthrough.rec', > > > > > rec_val_idx=3D'/home/piotr/data/val-passthrough.idx', > > save_dir=3D'params', > > > > > save_frequency=3D0, top_k=3D0, use_pretrained=3DTrue, use_rec=3DT= rue, > > > > use_se=3DFalse, > > > > > warmup_epochs=3D0, warmup_lr=3D0.0, wd=3D0.0001) > > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178: > > ImageRecordIOParser2: > > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for decoding= .. > > > > > INFO:root:Batch [0] > > > > > INFO:root:Top 1 accuracy: 0 > > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922 > > > > > INFO:root:Batch [1] > > > > > INFO:root:Top 1 accuracy: 0 > > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927 > > > > > INFO:root:Batch [2] > > > > > INFO:root:Top 1 accuracy: 0 > > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp > > > > (upstream_master)+$ > > > > > git st > > > > > On branch upstream_master > > > > > Your branch is up to date with 'origin/upstream_master'. > > > > > > > > > > Changes not staged for commit: > > > > > (use "git add/rm ..." to update what will be committed) > > > > > (use "git checkout -- ..." to discard changes in working > > > > directory) > > > > > deleted: 3rdparty/openmp > > > > > > > > > > no changes added to commit (use "git add" and/or "git commit -a") > > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp > > > > (upstream_master)+$ > > > > > ldd build/libmxnet.so | grep -i omp > > > > > libgomp.so.1 =3D> /usr/lib/x86_64-linux-gnu/libgomp.so.1 > > > > > (0x00007f941241c000) > > > > > > > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp > > > > (upstream_master)+$ > > > > > python > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py > > > > > --use-rec --batch-size 256 --dtype float32 --num-data-workers 40 > > --mode > > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore local > > > > --log-interval > > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx > > > > > ~/data/val-passthrough.idx > > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117 > > > > > INFO:root:Batch [16] > > > > > INFO:root:Top 1 accuracy: 0 > > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760 > > > > > INFO:root:Batch [17] > > > > > INFO:root:Top 1 accuracy: 0 > > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033 > > > --000000000000d20fcb059925b0f5--