mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Junru Shao <junrushao1...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0
Date Tue, 18 Jun 2019 21:10:09 GMT
Dear community,

I am happy to share some results with regard to commit 83d2c2d0e (PR
#14192, link: https://github.com/apache/incubator-mxnet/pull/14192) that
Pedro mentioned that causes regression.

First, using the exact model that Pedro provides, we did rigorous profiling
and found out that the PR #14192 slows it down by 7.26 ms (from 235.65 ms
to 242.91 ms).

Then, we submitted a following up PR #15262 (link:
https://github.com/apache/incubator-mxnet/pull/15262) to fix the
regression. By applying the patch to commit 83d2c2d0e, we could verify that
we get comparable performance. Please refer to the PR if you are interested
in our experiment.

That is to say, regression caused by commit 83d2c2d0e should have been
addressed. Please let me know if there is any future issues.

Thank you so much,
Junru

On Thu, Jun 13, 2019 at 3:05 PM Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> I reach you in private, the model is not public. We should be able to
> see this problem in a public model using LSTM I think.
>
>
> On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <junrushao1994@gmail.com>
> wrote:
> >
> > Hi Pedro,
> >
> > Thanks for brining this up!
> >
> > Could you provide your model so that we can dig into this?
> >
> > Thanks,
> > Junru
> >
> > On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > wrote:
> >
> > > I have isolated some of the commits that are causing performance
> > > regressions in wavenet like models:
> > >
> > > Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
> > > (#14192)
> > >
> > > Causes a regression making hybridize with static slower using GPU
> > > inference.
> > >
> > > [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
> > > int64 as tensor size (#14570)
> > >
> > > Causes overall regressions in CPU inference.
> > >
> > >
> > > Pedro.
> > >
> > > On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <royweilai@gmail.com> wrote:
> > > >
> > > > Hi @dev,
> > > >
> > > > I am canceling the vote as the issue Lin discovered require a fix[1]
> and
> > > > the solution is not ready yet.
> > > > It's a general problem when building from source with MXNet, not only
> > > > impacting horovod use cases.  Any help is appreciated.
> > > >
> > > > Other issues we are tracking:
> > > > 1. Regression on hybridize with static_alloc. (not a blocker for now)
> > > > 2. Scala doc issue [2], already merged in master, need to backport to
> > > 1.5.x
> > > >
> > > > Thanks for everyone's help! Please let us know if there is any other
> > > issue
> > > > with 1.5.0
> > > >
> > > > [1] https://github.com/apache/incubator-mxnet/pull/15213
> > > > [2] https://github.com/apache/incubator-mxnet/pull/15216
> > > >
> > > >
> > > >
> > > > Best Regards
> > > >
> > > > Lai
> > > >
> > > >
> > > > On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com>
> > > > wrote:
> > > >
> > > > > Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
> > > > >
> > > > > Looks like a general regression.
> > > > >
> > > > >
> > > > > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <royweilai@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > Thanks for the updates. Currently, we are able to confirm Lin's
> issue
> > > > > with
> > > > > > Horovod, and there is a fix pending. [1]
> > > > > > Will update later today to see if we need to cancel this vote
> for the
> > > > > fix.
> > > > > >
> > > > > > As for the hybridize with static alloc performance regression.
> IMO it
> > > > > does
> > > > > > not need to be a blocker if we have the following speed order.
> > > > > > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static
> 1.4.1 w/o
> > > static
> > > > > > and it will be great to know the following to better make a
> decision
> > > on
> > > > > > whether this should block the release.
> > > > > > 1) if this is a model specific or a general regression.
> > > > > > 2) if this is platform specific or general (w/ or w/o CUDA,
w/
> or w/o
> > > > > > MKLDNN)
> > > > > >
> > > > > >
> > > > > > [1]https://github.com/apache/incubator-mxnet/pull/15213
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Best Regards
> > > > > >
> > > > > > Lai
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zhreshold@apache.org>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2019/06/11 18:53:56, Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > > >
> > > > > > > wrote:
> > > > > > > > The stack trace doesn't seem to come from MXNet, do
you have
> more
> > > > > info?
> > > > > > > >
> > > > > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <
> zhreshold@apache.org
> > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 2019/06/11 17:36:09, Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > > > > A bit more background into this:
> > > > > > > > > >
> > > > > > > > > > While tuning a model using LSTM and convolutions
we find
> that
> > > > > using
> > > > > > > > > > hybridize with static_alloc and static_shape
is 15%
> slower
> > > in the
> > > > > > > > > > latest revision vs in version 1.4.1 in which
using
> hybridize
> > > with
> > > > > > > > > > static_alloc and static_shape is 10% faster
than without.
> > > > > > > > > >
> > > > > > > > > > Overwall we are still 33% faster when comparing
master to
> > > 1.5.
> > > > > > > > > >
> > > > > > > > > > Let me know if you think this is a release
blocker or
> not.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > > > > > <pedro.larroy.lists@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > -1
> > > > > > > > > > >
> > > > > > > > > > > We found a performance regression vs
1.4 related to
> > > CachedOp
> > > > > which
> > > > > > > > > > > affects Hybrid forward, which we are
looking into.
> > > > > > > > > > >
> > > > > > > > > > > Pedro.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin
Yuan <
> > > apeforest@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > -1 (Tentatively until resolved)
> > > > > > > > > > > >
> > > > > > > > > > > > I tried to build MXNet 1.5.0 from
source and pip
> install
> > > > > horovod
> > > > > > > but got
> > > > > > > > > > > > the following error:
> > > > > > > > > > > >
> > > > > > > > > > > > Reproduce:
> > > > > > > > > > > > 1) cp make/config.mk .
> > > > > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN,
USE_NCCL
> > > > > > > > > > > > 3) make -j
> > > > > > > > > > > >
> > > > > > > > > > > > MXNet can build successfully.
> > > > > > > > > > > >
> > > > > > > > > > > > 4) pip install horovod
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > >
> > > > >
> > >
> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > > > > > fatal error: mkldnn_version.h:
No such file or
> directory
> > > > > > > > > > > >     compilation terminated.
> > > > > > > > > > > >     INFO: Unable to build MXNet
plugin, will skip it.
> > > > > > > > > > > >
> > > > > > > > > > > > I did not change any setting of
MKLDNN in my
> config.mk.
> > > I am
> > > > > > > building on
> > > > > > > > > > > > DLAMI base 18.0 which is Ubuntu
16.04 and CUDA 10.0
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Lin
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM
shiwen hu <
> > > > > yajiedesign@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > +1
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lai Wei <royweilai@gmail.com>
于2019年6月9日周日
> 上午4:12写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is the 3-day vote
to release Apache MXNet
> > > > > (incubating)
> > > > > > > version
> > > > > > > > > > > > > 1.5.0.
> > > > > > > > > > > > > > Voting on dev@ will
start June 8,
> 23:59:59(PST)  and
> > > > > close
> > > > > > > on June 11,
> > > > > > > > > > > > > > 23:59:59.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > > > > >
> > > > > > >
> > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3) Link to source and
signatures on apache dist
> > > server:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > >
> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please remember to TEST
first before voting
> > > accordingly:
> > > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > > -1 = disapprove (provide
reason)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Lai
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > -1. Built from source, import mxnet in python
cause
> Segfault.
> > > > > > > > >
> > > > > > > > > back trace:
> > > > > > > > >
> > > > > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation
> fault.
> > > > > > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > > > > > (gdb) bt
> > > > > > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > > std::allocator<char> > const&, bool
const&, unsigned int
> > > const&) ()
> > > > > > > from
> > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > > std::allocator<char> > const&, bool
const&, unsigned int
> > > const&) ()
> > > > > > > from
> > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&)
()
> from
> > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > > #4  0x00007ffff29d5c48 in ?? () from
> > > > > /usr/lib/python3/dist-packages/
> > > > > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx
()
> > > > > > > > > #7  0x000000000053fc97 in ?? ()
> > > > > > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > > > > > #9  0x000000000054a328 in ?? ()
> > > > > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx
()
> > > > > > > > > #12 0x000000000053fc97 in ?? ()
> > > > > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx
()
> > > > > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx
()
> > > > > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx
()
> > > > > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx
()
> > > > > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > > > > > >
> > > > > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA
10.0, built
> with
> > > > > > > USE_CUDA=1,
> > > > > > > > > USE_CUDNN=1, the rest are default values.
> > > > > > > > >
> > > > > > > > > -Zhi
> > > > > > > >
> > > > > > >
> > > > > > > Change to +1, I figured out that it was due to the
> dependencies. I
> > > > > still
> > > > > > > have issue using DL base AMI with python3, but I will not
> regard
> > > it as
> > > > > a
> > > > > > > blocker to 1.5 release.
> > > > > > > Tested Gluon-CV training and works fine.
> > > > > > >
> > > > > > > -Zhi
> > > > > > >
> > > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message