From dev-return-6999-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org  Sat Dec  7 23:40:54 2019
Return-Path: <dev-return-6999-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 5A4BA18065B
	for <archive-asf-public@cust-asf.ponee.io>; Sun,  8 Dec 2019 00:40:54 +0100 (CET)
Received: (qmail 63060 invoked by uid 500); 7 Dec 2019 23:40:53 -0000
Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@mxnet.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@mxnet.incubator.apache.org>
List-Post: <mailto:dev@mxnet.incubator.apache.org>
List-Id: <dev.mxnet.incubator.apache.org>
Reply-To: dev@mxnet.incubator.apache.org
Delivered-To: mailing list dev@mxnet.incubator.apache.org
Received: (qmail 63047 invoked by uid 99); 7 Dec 2019 23:40:53 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Dec 2019 23:40:53 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B25BE1A3497
	for <dev@mxnet.apache.org>; Sat,  7 Dec 2019 23:40:52 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id KPtc2b47-Hnd for <dev@mxnet.apache.org>;
	Sat,  7 Dec 2019 23:40:50 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.195; helo=mail-lj1-f195.google.com; envelope-from=pedro.larroy.lists@gmail.com; receiver=<UNKNOWN> 
Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id CE571BC509
	for <dev@mxnet.incubator.apache.org>; Sat,  7 Dec 2019 23:40:49 +0000 (UTC)
Received: by mail-lj1-f195.google.com with SMTP id r19so11583526ljg.3
        for <dev@mxnet.incubator.apache.org>; Sat, 07 Dec 2019 15:40:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=BdccZKrB1TokL09oPfwdqTEvo7J6Zfs148PnjQe3O3A=;
        b=dhhP0beE4dXpzvZ/Z8cgyv3jrlYcejiuO6QihWktUmzMC16JoKRg5flu6yJJDYX1Kr
         r3DVyGDZM5GZb2emDZoetDSUuXj04LUrFl5aMQpFQiY4PHdexComQGKfotvnUMPV2qAv
         Z5+DPnL4ojTfJ9Mq2eTcyPwuzP+umx5xgfObmCafTaDoPnKa6aLUSYGEkO8PTZegvBsR
         TcaQjil97LJE/0Ttkloum6Qcv87ppiS2slV2YRO17tz2EUf5K2PQPhs3nRUBY+AdMlmz
         dKs0e32oKz8paYeJ4R2rdxjmKA40p6wr14OfDxdHz2py628Z99K763tXlEmRQn9fD938
         eDCw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=BdccZKrB1TokL09oPfwdqTEvo7J6Zfs148PnjQe3O3A=;
        b=F5rwhX/Uhf08+GKj1BEtHMOoLW7sYNIuYaD1YkVwDs6EbDdkwrv7yyZaqGQmNQmF84
         tF9tjjl5Y/Q7KLlovVfTFJGvT67eFHhWoxbY+z+prhvtL45XzSvR8LGTDBUGbD/CKT6v
         r1U83Yj6E58GeWZW8rlFXMhCiD7qmLSyT2E1QTxUB+4zLD+mlu0bpMvyxNSUJkGqEI1l
         S2jMi+6/WCA15KTX5MbgF1hWYg/OjpNOhsZ/iLMkhrVpyIWDvmwAS+cZ5LHH4MXSndem
         SFLMv72oBFgpirf4IqCqKqwmKC+xZgSMOGTkVAJdSMkP3iTvAM9GVQdsbkYsEqb8TOVQ
         ed7Q==
X-Gm-Message-State: APjAAAWOLiuYb0cSiIShKjQrEokdmMqaI+g3EhQb57v7ykhrW1IfzVZm
	nOJtiMtECXc7dbQ2EU8b1F/pBAW04Lljpblfdl/Gex9s9OY=
X-Google-Smtp-Source: APXvYqzD+zzrvudiSu6Ck5+W/Xq9rM+z4gGK47DxMOeVkPpfrfsfhDrNdwQZnCLN1W0LcWsWSDnbRqeTrNs/sMSOIlQ=
X-Received: by 2002:a2e:894b:: with SMTP id b11mr12630713ljk.118.1575762042948;
 Sat, 07 Dec 2019 15:40:42 -0800 (PST)
MIME-Version: 1.0
References: <CAC_CU1gixzbgRZJ-kj0rKWkFQYHb_K-37_t7TgGcmrSErb-e8Q@mail.gmail.com>
 <48beda77473f21bcce106b29cdbfbb27ff666d00.camel@amazon.com>
 <CAC_CU1j23gd_sH44R7oxVd9+CAurPUN26S52pDWSnNe8DhR11Q@mail.gmail.com>
 <e3f682df24e685e4b7cf198adcbd62c157096948.camel@amazon.com> <CABgAAfc+VJoVVWfaZbQtWpXg39ZukD4SFiJX-rJ=fb127AxYBQ@mail.gmail.com>
In-Reply-To: <CABgAAfc+VJoVVWfaZbQtWpXg39ZukD4SFiJX-rJ=fb127AxYBQ@mail.gmail.com>
From: Pedro Larroy <pedro.larroy.lists@gmail.com>
Date: Sat, 7 Dec 2019 15:40:30 -0800
Message-ID: <CAC_CU1hZzkuaUfEtFFWoejNjx-Ox-yfYcHt75FQjjWMr-67svw@mail.gmail.com>
Subject: Re: Please remove conflicting Open MP version from CMake builds
To: dev@mxnet.incubator.apache.org
Content-Type: multipart/alternative; boundary="000000000000d20fcb059925b0f5"

--000000000000d20fcb059925b0f5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Stop disseminating false information:

https://github.com/apache/incubator-mxnet/issues/14979


On Sat, Dec 7, 2019 at 7:04 AM Chris Olivier <cjolivier01@gmail.com> wrote:

> -1
>
> mkldnn removed omp5 for licencing issues
> no bugs have actually been traced to the use of llvm openmp. only an asse=
rt
> caused by an actual bug in mxnet code. there are suitable workarounds.
>
> over time llvm omp has simply been used as a =E2=80=9Ccatch all=E2=80=9D =
for random
> problems that aren=E2=80=99t related at all (such as getenv race conditio=
n in an
> atfork call that isn=E2=80=99t even part of an omp parallel region).
>
> proposal is now and has always been roughly equivalent to the idea of
> =E2=80=9Ccomment out an assert rather than fix the bug it=E2=80=99s repor=
ting=E2=80=9D.
>
> Up until very recently, Makefile version of mxnet used libomp5 for YEARS
> and not libgomp, with no issue reported (omp not built in debug mode), so
> the equivalent configuration from CMake mysteriously causing myriads if
> problems has questionable merit and smells more like a hubris situation.
>
> I use tensorflow as well and it links to libomp5 rather than libgomp.
>
> if the assert problem is really a problem, the bug being reported would b=
e
> prioritized and fixed. it should be fixed regardless. all the time spent =
by
> some CI people trying to remove this could have simply fixed the actual b=
ug
> in a small fraction of the time.
>
>
> On Fri, Dec 6, 2019 at 8:44 PM Lausen, Leonard <lausen@amazon.com.invalid=
>
> wrote:
>
> > I think it's reasonable to assume that the Intel MKLDNN team is an
> > "authorative"
> > source about the issue of compilation with OpenMP and the OpenMP runtim=
e
> > library
> > related issues. Thus I suggest we follow the recommendation of Intel
> > MKLDNN team
> > within the MXNet project.
> >
> > Looking through the Intel MKLDNN documentation, I find [1]:
> >
> > > DNNL uses OpenMP runtime library provided by the compiler.
> >
> > as well as
> >
> > > it's important to ensure that only one OpenMP runtime is used
> throughout
> > the
> > > application. Having more than one OpenMP runtime linked to an
> executable
> > may
> > > lead to undefined behavior including incorrect results or crashes.
> >
> > To keep our project maintainable and error free, I thus suggest we foll=
ow
> > DNNL
> > and use the OpenMP runtime library provided by the compiler.
> > We have limited ressources and finding the root cause for any bugs
> > resulting
> > from linking multiple OpenMP libraries as currently done is, in my
> > opinion. not
> > a good use of time. We know it's due to undefined behavior and we know
> > it's best
> > practice to use OpenMP runtime library provided by the compiler. So let=
's
> > just
> > do that.
> >
> > I think given that MKL-DNN has also adopted the "OpenMP runtime library
> > provided
> > by the compiler" approach, this issue is not contentious anymore and
> > qualifies
> > for lazy consensus.
> >
> > Thus if there is no objection within 72 hours (lazy consensus), let's
> drop
> > bundled LLVM OpenMP from master [2]. If we find any issues due to
> > droppeing the
> > bundled LLVM OpenMP, we can always add it back prior to the next releas=
e.
> >
> > Best regards
> > Leonard
> >
> > [1]:
> >
> >
> https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b=
1921d/doc/build/build_options.md#openmp
> > (This is the updated reference from Anton's previous comment, based on
> the
> > changes in MKLDNN done in the meantime
> >
> https://github.com/apache/incubator-mxnet/pull/12160#issuecomment-4150780=
66
> > )
> > [2]: Alike https://github.com/apache/incubator-mxnet/pull/12160
> >
> >
> > On Fri, 2019-12-06 at 12:16 -0800, Pedro Larroy wrote:
> > > I will try to stay on the sidelines for now since previous
> conversations
> > > about OMP have not been productive here and I have spent way too much
> > time
> > > on this already, I'm not the first one giving up on trying to help wi=
th
> > > this topic.
> > >
> > > I would be glad if you guys can work together and find a solution. I
> will
> > > just put my understanding of the big picture hoping that it helps mov=
e
> it
> > > forward.
> > >
> > >
> > > Recently the intel omp library which seemed to have the best
> performance
> > of
> > > the 3 was removed from MKL.
> > >
> > > - There's 3 libraries in play, GNU Omp which is shipped with gcc
> (gomp),
> > > LLVM openmp in 3rdparty (llvm-omp), Intel OMP when using MKL, which i=
s
> > > recently removed (iomp)
> > >
> > > - IOMP seems to have the best performance, there's stability issues
> > > producing crashes sometimes but the impact seems relatively small for
> > users
> > > and developers. In general seems linking with a different OMP version
> > that
> > > the one shipped with the compiler is known to cause stability issues
> but
> > > it's done anyway.
> > >
> > > - LLVM-OMP used when building with CMake, not used in the PIP release=
s
> or
> > > when building with Make. Has stability issues, hangs when running in
> > debug
> > > mode during test execution and produces tons of assertions in debug
> mode.
> > > Might have some small performance gains but there is no clear cut dat=
a
> > that
> > > showcases significant performance gains.
> > >
> > > - GOMP is the version shipped with GCC and the PIP wheels without MKL=
,
> > has
> > > no stability problems.
> > >
> > > As a ballpark, IOMP might give 10% performance improvement in some
> cases.
> > >
> > > We need to document well how users should tune and configure MXNet wh=
en
> > > using OMP.
> > >
> > > As a developer, the safest bet is to use GOMP to be able to debug and
> > > develop without issues. As a user of CPU inference / training you wan=
t
> to
> > > run MKL so depends on how the Intel guys want to do things. My
> preference
> > > as an engineer is always stability > speed.
> > >
> > > Related tickets:
> > >
> > > https://github.com/apache/incubator-mxnet/issues/16891
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/issues/10856#issuecomment-56263=
7931
> > >
> > >
> > > https://github.com/apache/incubator-mxnet/issues/11417
> > >
> > > https://github.com/apache/incubator-mxnet/issues/15690
> > >
> > >
> > >
> > > On Fri, Dec 6, 2019 at 12:39 AM Lausen, Leonard
> > <lausen@amazon.com.invalid>
> > > wrote:
> > >
> > > > Is this related to
> > https://github.com/apache/incubator-mxnet/issues/10856?
> > > >
> > > > I unlocked that Github issue based on the Apache Code of Conduct
> > > >
> https://www.apache.org/foundation/policies/conduct#specific-guidelines
> > > >
> > > >
> > > > On Sat, 2019-11-30 at 02:47 -0800, Pedro Larroy wrote:
> > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6 (upstream_master)+$
> ldd
> > > > > build/libmxnet.so| grep -i openmp
> > > > >         libomp.so =3D>
> > > > > /home/piotr/mxnet_1.6/build/3rdparty/openmp/runtime/src/libomp.so
> > > > > (0x00007fde0991d000)
> > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6 (upstream_master)+$
> > python
> > > > > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > --use-rec
> > > > > --batch-size 256 --dtype float32 --num-data-workers 40 --mode
> hybrid
> > > > > --model resnet50_v2 --use-pretrained --kvstore local
> --log-interval 1
> > > > > --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > ~/data/val-passthrough.idx
> > > > > INFO:root:Namespace(batch_norm=3DFalse, batch_size=3D256,
> > > > > data_dir=3D'~/.mxnet/datasets/imagenet', dataset_size=3D32,
> > dtype=3D'float32',
> > > > > kvstore=3D'local', last_gamma=3DFalse, log_interval=3D1,
> > logging_dir=3D'logs',
> > > > > lr=3D0.1, lr_decay=3D0.1, lr_decay_epoch=3D'40,60', lr_mode=3D'st=
ep',
> > > > > lr_poly_power=3D2, mode=3D'hybrid', model=3D'resnet50_v2', moment=
um=3D0.9,
> > > > > num_epochs=3D3, num_gpus=3D0, num_workers=3D40,
> > > > > rec_val=3D'/home/piotr/data/val-passthrough.rec',
> > > > > rec_val_idx=3D'/home/piotr/data/val-passthrough.idx',
> > save_dir=3D'params',
> > > > > save_frequency=3D0, top_k=3D0, use_pretrained=3DTrue, use_rec=3DT=
rue,
> > > > use_se=3DFalse,
> > > > > warmup_epochs=3D0, warmup_lr=3D0.0, wd=3D0.0001)
> > > > > [10:42:02] ../src/io/iter_image_recordio_2.cc:178:
> > ImageRecordIOParser2:
> > > > > /home/piotr/data/val-passthrough.rec, use 36 threads for decoding=
..
> > > > > INFO:root:Batch [0]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 5 samples/sec warmup_time 43.150922
> > > > > INFO:root:Batch [1]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 6 samples/sec warmup_time 37.971927
> > > > > INFO:root:Batch [2]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 7 samples/sec warmup_time 35.755363
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > (py3_venv) piotr@34-215-197-42:0:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > git st
> > > > > On branch upstream_master
> > > > > Your branch is up to date with 'origin/upstream_master'.
> > > > >
> > > > > Changes not staged for commit:
> > > > >   (use "git add/rm <file>..." to update what will be committed)
> > > > >   (use "git checkout -- <file>..." to discard changes in working
> > > > directory)
> > > > >         deleted:    3rdparty/openmp
> > > > >
> > > > > no changes added to commit (use "git add" and/or "git commit -a")
> > > > > (py3_venv) piotr@34-215-197-42:1:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > ldd build/libmxnet.so | grep -i omp
> > > > >         libgomp.so.1 =3D> /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > > > (0x00007f941241c000)
> > > > >
> > > > > (py3_venv) piotr@34-215-197-42:130:~/mxnet_1.6_plat_omp
> > > > (upstream_master)+$
> > > > > python
> > ~/deeplearning-benchmark/image_classification/infer_imagenet.py
> > > > > --use-rec --batch-size 256 --dtype float32 --num-data-workers 40
> > --mode
> > > > > hybrid --model resnet50_v2 --use-pretrained --kvstore local
> > > > --log-interval
> > > > > 1 --rec-val ~/data/val-passthrough.rec --rec-val-idx
> > > > > ~/data/val-passthrough.idx
> > > > > INFO:root:warmup_throughput: 147 samples/sec warmup_time 1.735117
> > > > > INFO:root:Batch [16]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 143 samples/sec warmup_time 1.785760
> > > > > INFO:root:Batch [17]
> > > > > INFO:root:Top 1 accuracy: 0
> > > > > INFO:root:warmup_throughput: 148 samples/sec warmup_time 1.729033
> >
>

--000000000000d20fcb059925b0f5--