From dev-return-4923-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org  Thu Nov 22 16:23:20 2018
Return-Path: <dev-return-4923-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 1D56E180645
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 22 Nov 2018 16:23:19 +0100 (CET)
Received: (qmail 32164 invoked by uid 500); 22 Nov 2018 15:23:19 -0000
Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@mxnet.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@mxnet.incubator.apache.org>
List-Post: <mailto:dev@mxnet.incubator.apache.org>
List-Id: <dev.mxnet.incubator.apache.org>
Reply-To: dev@mxnet.incubator.apache.org
Delivered-To: mailing list dev@mxnet.incubator.apache.org
Received: (qmail 32140 invoked by uid 99); 22 Nov 2018 15:23:18 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Nov 2018 15:23:18 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1AEDBC21FC
	for <dev@mxnet.apache.org>; Thu, 22 Nov 2018 15:23:18 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.337
X-Spam-Level:
X-Spam-Status: No, score=0.337 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_HIGH=-1.462, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2,
	KAM_SHORT=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (1024-bit key) header.d=airbnb.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id oRfQ906MjJcW for <dev@mxnet.apache.org>;
	Thu, 22 Nov 2018 15:23:16 +0000 (UTC)
Received: from mail-oi1-f194.google.com (mail-oi1-f194.google.com [209.85.167.194])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id B905C61097
	for <dev@mxnet.incubator.apache.org>; Thu, 22 Nov 2018 15:23:15 +0000 (UTC)
Received: by mail-oi1-f194.google.com with SMTP id h25so7748161oig.1
        for <dev@mxnet.incubator.apache.org>; Thu, 22 Nov 2018 07:23:15 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=OSNYyTCBt0SKmHmn5l+o6M8vUDtKDDZ4fSh3j1zGx7E=;
        b=pa/m8QLZTkAFZo8QMA3b3LA3AaUmHpCPhaZD8VQZulgymfQB35RpmvedJHuD7JrKve
         9kPdJGXuFHyoImR/ym4FhBwUA72c3IUWxQgkMyYDF+V8gjxeMXeOKsCC2lVyEB//gOEC
         8egHypqpYG/T5GbC3TcLEJPH4kY7XkYQZOFKgU5+VAdUNOINmbPXxsakJjaa2xFI3x/v
         q2XuoNHqIfCgUR3REO7JhXcvwlYhDakev/aQPpUtxaTtfP6Cl1G/LZzbBbnztiZKkhxC
         7qwWkdWCeNWyetaIlx+5kU7TkCqYpaMefjOPhZj6a0j+JYeDH8kCZ6oX+upnmOnBOC4g
         8rpg==
X-Gm-Message-State: AGRZ1gIZZ1yRF7g5zebwXo73suVVL4pwiEgdJI2Nsi53LNUNZiu25sQH
	Mw2OoUa6JyBZzVcXVqMRyG5iMt1Aobq67I6gfpzFUaGi
X-Google-Smtp-Source: AJdET5ffwKNmcHEiGne6GJOV7ybtrNERu4FGcjwxbd39xQwlNRkriaND86kDyg6PwC0vcjPpMon8H783KjtsgqLllrs=
X-Received: by 2002:aca:8d1:: with SMTP id 200-v6mr6015662oii.125.1542900194559;
 Thu, 22 Nov 2018 07:23:14 -0800 (PST)
MIME-Version: 1.0
References: <CAEsJgS=bC1XtmGWORgBMcR5BVRQVRTgD1nGUFm20t8wZU0AcGg@mail.gmail.com>
 <CABgAAfdJcHd-TWUz-nnQ7VM-3j3x46XGQX_iB-+LAFb+vjxyuA@mail.gmail.com>
 <CABgAAffcpMsaCcxoqdb15faXT2ARo+2Yt12ZJ3uEHruH6Z-tfQ@mail.gmail.com> <CAEsJgSmKOBby9jb1Smg7bvXWv7a-w4Uy3j7ngTqRbKGD7nSpVQ@mail.gmail.com>
In-Reply-To: <CAEsJgSmKOBby9jb1Smg7bvXWv7a-w4Uy3j7ngTqRbKGD7nSpVQ@mail.gmail.com>
From: Alfredo Luque <alfredo.luque@airbnb.com.INVALID>
Date: Thu, 22 Nov 2018 10:23:03 -0500
Message-ID: <CAM2Q420GhsEy_KqotE6JKSzVYXZCiCD4Eo6Khut5Ex3+hNQ9hA@mail.gmail.com>
Subject: Re: [Discussion] Remove bundled llvm OpenMP
To: dev@mxnet.incubator.apache.org
Content-Type: multipart/alternative; boundary="000000000000057037057b4272a0"

--000000000000057037057b4272a0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

The proposal here is not to eliminate the use of OpenMP but rather to use
the compiler's OpenMP implementation rather than a bundled one. I've been
bitten by issues with having multiple linked OpenMP implementations before
in another library and it was extremely difficult to debug.


It seems to me that tackling the issue with the assert is an orthogonal
issue altogether.

--Alfredo Luque

Software Engineer
Airbnb
Machine Learning Infrastructure

On Thu, Nov 22, 2018 at 10:12 AM Anton Chernov <mechernov@gmail.com> wrote:

> Hi Chris,
>
> Thank you for your answer. If you have noticed the initial email comes fr=
om
> me, Anton Chernov (@lebeg on Github) and thus the proposal is not from an=
y
> 'Ci' team that you've mentioned, but from me personally.
>
> You are writing:
>
> > someone is doing something unhealthy when they fork ...
>
> I'm missing any context to understand what you mean.
>
> > we get a lot of performance gain from OMP ...
>
> There is no data that would prove this statement and therefore it is a
> random guess.
>
> > in many months, no investigation has occurred as to WHY the assertion i=
s
> failing.
>
> The investigation has concluded that this is happening due to undefined
> behaviour which is, in my opinion, a suffient answer that does not requir=
e
> to go any deeper.
>
> > the pr is vetoed until such a time that the actual root cause of the
> problem is known.
>
> And considering the statements above there is no valid reason to veto the
> PR.
>
>
> Best
> Anton
>
> =D1=87=D1=82, 22 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0=B2 15:38, Chr=
is Olivier <cjolivier01@gmail.com>:
>
> > 3x less overhead*
> >
> > On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier <cjolivier01@gmail.com>
> > wrote:
> >
> > > someone is doing something unhealthy when they fork, which is causing
> an
> > > assertion in the openmp library. the same assertion that would fire i=
n
> > mkl,
> > > which is linked to libiomp5 (exact same omp library). this is new
> > behavior
> > > and most likely due to an error or suboptimal approach in the forking
> > logic
> > > in mxnet.
> > >
> > > in order to circumvent the assert, the Ci team is proposing to remove
> the
> > > library completely which is equivalent to cutting off your leg to mak=
e
> > the
> > > pain from stubbing your toe go away.
> > >
> > > we get a lot of performance gain from OMP. is has about a 1/3 less
> > > overhead for entering omp regions and also supports omp regions after=
 a
> > > fork, which libgomp does not.
> > >
> > > in many months, no investigation has occurred as to WHY the assertion
> is
> > > failing.
> > >
> > > the pr is vetoed until such a time that the actual root cause of the
> > > problem is known.
> > >
> > >
> > > thanks,
> > >
> > > -Chris.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov <mechernov@gmail.com>
> > wrote:
> > >
> > >> Dear MXNet community,
> > >>
> > >> I would like to drive attention to an important issue that is presen=
t
> in
> > >> the MXNet CMake build: usage of bundled llvm OpenMP library.
> > >>
> > >> I have opened a PR to remove it:
> > >> https://github.com/apache/incubator-mxnet/pull/12160
> > >>
> > >> The issue was closed, but I am strong in my oppinion that it's the
> right
> > >> thing to do.
> > >>
> > >> *Background*
> > >> If you want to use OpenMP pragmas in your code for parallelization y=
ou
> > >> would supply a special flag to the compiler:
> > >>
> > >> - Clang / -fopenmp
> > >> https://openmp.llvm.org/
> > >>
> > >> - GCC / -fopenmp
> > >> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
> > >>
> > >> - Intel / [Q]openmp
> > >>
> > >>
> >
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD8=
1C7008D1
> > >>
> > >> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
> > >> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
> > >>
> > >> Each of the compilers would enable the '#pragma omp' directive durin=
g
> > >> C/C++
> > >> compilation and arrange for automatic linking of the OpenMP runtime
> > >> library
> > >> supplied by each complier separately.
> > >>
> > >> Thus, to use the advantages of an OpenMP implementation one has to
> > compile
> > >> the code with the corresponding compiler.
> > >>
> > >> Currently, in MXNet CMake build scripts a bundled version of llvm
> OpenMP
> > >> is
> > >> used ([1] and [2]) to replace the OpenMP library supplied by the
> > compiler.
> > >>
> > >> I will quote here the README from the MKL-DNN (Intel(R) Math Kernel
> > >> Library
> > >> for Deep Neural Networks):
> > >>
> > >> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP
> > runtime
> > >> library to work. As different OpenMP runtimes may not be binary
> > compatible
> > >> it's important to ensure that only one OpenMP runtime is used
> throughout
> > >> the application. Having more than one OpenMP runtime initialized may
> > lead
> > >> to undefined behavior resulting in incorrect results or crashes." [3=
]
> > >>
> > >> And:
> > >>
> > >> "Using GNU compiler with -fopenmp and -liomp5 options will link the
> > >> application with both Intel and GNU OpenMP runtime libraries. This
> will
> > >> lead to undefined behavior of the application." [4]
> > >>
> > >> As can be seen from ldd for MXNet:
> > >>
> > >> $ ldd build/tests/mxnet_unit_tests | grep omp
> > >>     libomp.so =3D>
> /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
> > >> (0x00007f697bc55000)
> > >>     libgomp.so.1 =3D> /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > >> (0x00007f69660cd000)
> > >>
> > >> *Performance*
> > >>
> > >> The only performance data related to OpenMP in MXNet I was able to
> find
> > is
> > >> here:
> > >>
> > >>
> >
> https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711=
172
> > >>
> > >> Which in my understanding is testing imact of different environment
> > >> variables for the same setup (using same bundled OpenMP library).
> > >>
> > >> The libraries may differ in implementation and the Thread Affinity
> > >> Interface [5] may have significant impact on performance.
> > >>
> > >> All compliers support it:
> > >>
> > >> - Clang / KMP_AFFINITY
> > >>
> > >>
> >
> https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.=
cpp
> > >>
> > >> - GCC / GOMP_CPU_AFFINITY
> > >>
> > >>
> >
> https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINIT=
Y.html
> > >>
> > >> - Intel / KMP_AFFINITY
> > >>
> > >>
> >
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD8=
1C7008D1
> > >>
> > >> - Visual Studio / SetThreadAffinityMask
> > >>
> > >>
> >
> https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-s=
etthreadaffinitymask
> > >>
> > >> *Issues*
> > >>
> > >> Failed OpenMP assertion when loading MXNet compiled with DEBUG=3D1
> > >> https://github.com/apache/incubator-mxnet/issues/10856
> > >>
> > >> libomp.so dependency (need REAL fix)
> > >> https://github.com/apache/incubator-mxnet/issues/11417
> > >>
> > >> mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MK=
L
> > >> https://github.com/apache/incubator-mxnet/issues/8532
> > >>
> > >> Performance regression when OMP_NUM_THREADS environment variable is
> not
> > >> set
> > >> https://github.com/apache/incubator-mxnet/issues/9744
> > >>
> > >> Poor concat CPU performance on CUDA builds
> > >> https://github.com/apache/incubator-mxnet/issues/11905
> > >>
> > >> I would appreciate hearing your thoughts.
> > >>
> > >>
> > >> Best
> > >> Anton
> > >>
> > >> [1]
> > >>
> > >>
> >
> https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L400=
-L405
> > >> [2] https://github.com/apache/incubator-mxnet/tree/master/3rdparty
> > >> [3] https://github.com/intel/mkl-dnn/blame/master/README.md#L261-L26=
5
> > >> [4] https://github.com/intel/mkl-dnn/blame/master/README.md#L278-L28=
0
> > >> [5] https://software.intel.com/en-us/node/522691
> > >>
> > >
> >
>

--000000000000057037057b4272a0--