mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anirudh <anirudh2...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2
Date Fri, 04 May 2018 01:10:23 GMT
Hi all,

I have added the CMake issue with USE_MKLDNN=1 to known issues in release
notes: https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2

Anirudh

On Thu, May 3, 2018 at 2:32 PM, Naveen Swamy <mnnaveen@gmail.com> wrote:

> + 0
>
> On Thu, May 3, 2018 at 12:44 PM, Anirudh <anirudh2290@gmail.com> wrote:
>
> > Hi Naveen,
> >
> > You raise a good point and I agree that by default MKLDNN by default
> should
> > be switched off.
> > Because of a bug in Cmakelists.txt which has been fixed as part of
> #10731,
> > which is merged to master (but not on the release branch),
> > the users won't have MKLDNN enabled even though MKLDNN is set to ON by
> > default.
> > Since cmake install instructions have not been published in mxnet.io and
> > is
> > lesser used compared to pip package and make from what I have seen,
> >  would it be acceptable for you if this is added as a known issue and a
> > workaround is provided in the release notes ?
> > The impacted users(who will have to use the workaround) here would be the
> > customers who are interested in MKLDNN and use cmake, users who aren't
> > interested in using MKLDNN feature won't be impacted.
> >
> > Anirudh
> >
> > On Thu, May 3, 2018 at 12:16 PM, Marco de Abreu <
> > marco.g.abreu@googlemail.com> wrote:
> >
> > > The MKLDNN tests are not really less stable than the other tests. It's
> > > pretty much the same across all tests we have. So I wouldn't say
> there's
> > a
> > > need to fix them in a separate branch.
> > >
> > > On Thu, May 3, 2018 at 9:00 PM, Naveen Swamy <mnnaveen@gmail.com>
> wrote:
> > >
> > > > I also meant(but forgot to send), we stabilize it on a separate
> branch
> > > and
> > > > then bring in the changes instead of blocking the PRs.
> > > >
> > > > On Thu, May 3, 2018 at 11:57 AM, Marco de Abreu <
> > > > marco.g.abreu@googlemail.com> wrote:
> > > >
> > > > > I think the failing tests are really getting an issue. We now got
> > > roughly
> > > > > 50 test failure related issues [1], leading to a average failure
> rate
> > > of
> > > > > 50%. Considering the costs in terms of money and time per run, this
> > is
> > > > > adding up quite significantly.
> > > > >
> > > > > Didn't we just remove MKLML from our codebase to replace it with
> > > MKLDNN?
> > > > I
> > > > > think removing something and marking the replacement as
> experimental
> > > > could
> > > > > be difficult from a user perspective. Personally, I don't really
> feel
> > > > > comfortable solving the problem of known issues by marking
> something
> > as
> > > > > experimental. We're basically shifting the responsibility to our
> > users
> > > > that
> > > > > way.
> > > > >
> > > > > I don't think we should stop testing MKLDNN in our CI. We already
> had
> > > the
> > > > > situation a few months ago where the solution to failed tests was
> to
> > > > > disable them. We shouldn't go back to that.
> > > > >
> > > > > -Marco
> > > > >
> > > > > [1]:
> > > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > > 3Aopen+is%3Aissue+label%3ATest
> > > > >
> > > > > On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy <mnnaveen@gmail.com>
> > > wrote:
> > > > >
> > > > > > USE_MKLDNN is set to ON in the cmake file by default, since
its
> > > > > > experimental can we turn OFF  so there is some determinism when
> > users
> > > > > build
> > > > > > and test.
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/blob/
> > > > > > 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On a separate note, since MKLDNN is experimental can we stop
> > building
> > > > on
> > > > > > master and cause PR's to queue up.
> > > > > >
> > > > > >
> > > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh <anirudh2290@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Correction: I was able to reproduce the issue with MKLDNN
> enabled
> > > on
> > > > > > > master, but not on 1.2 branch.
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh <
> anirudh2290@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Pedro and Naveen,
> > > > > > > >
> > > > > > > > I am unable to reproduce this issue with MKLDNN on
the master
> > but
> > > > not
> > > > > > on
> > > > > > > > the 1.2.RC2 branch.
> > > > > > > >
> > > > > > > > Did the following on 1.2.RC2 branch:
> > > > > > > >
> > > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > > USE_DIST_KVSTORE=0
> > > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > > > export MXNET_TEST_SEED=11
> > > > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > > > export MXNET_TEST_COUNT=10000
> > > > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > > > module.py:test_forward_reshape
> > > > > > > >
> > > > > > > > Was able to do the 10k runs successfully.
> > > > > > > >
> > > > > > > > Anirudh
> > > > > > > >
> > > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <
> anirudh2290@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi Pedro and Naveen,
> > > > > > > >>
> > > > > > > >> Is this issue reproducible when MXNet is built
with
> > > USE_MKLDNN=0?
> > > > > > > >> Also, there are a bunch of MKLDNN fixes that didn't
go into
> > the
> > > > > > release
> > > > > > > >> branch. Is this issue reproducible on the release
branch ?
> > > > > > > >> In my opinion, since we have marked MKLDNN as
experimental
> > > feature
> > > > > for
> > > > > > > >> the release, if it is confirmed to be a MKLDNN
issue
> > > > > > > >> we don't need to block the release on it.
> > > > > > > >>
> > > > > > > >> Anirudh
> > > > > > > >>
> > > > > > > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy <
> > > mnnaveen@gmail.com>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> Thanks for raising this issue Pedro.
> > > > > > > >>>
> > > > > > > >>> -1(binding)
> > > > > > > >>>
> > > > > > > >>> We were in a similar state for a while a year
ago, a lot of
> > > > effort
> > > > > > went
> > > > > > > >>> to
> > > > > > > >>> stabilize the tests and the CI. I have seen
the PR builds
> are
> > > > > > > >>> non-deterministic and you have to retry over
and over
> > (wasting
> > > > > > > resources
> > > > > > > >>> and time) and hope you get lucky.
> > > > > > > >>>
> > > > > > > >>> Look at the dashboard for master build
> > > > > > > >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > > > > mxnet/job/master/
> > > > > > > >>>
> > > > > > > >>> -Naveen
> > > > > > > >>>
> > > > > > > >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy
<
> > > > > > > >>> pedro.larroy.lists@gmail.com>
> > > > > > > >>> wrote:
> > > > > > > >>>
> > > > > > > >>> > -1  nondeterminisitc failures on CI master:
> > > > > > > >>> > https://issues.apache.org/jira/browse/MXNET-396
> > > > > > > >>> >
> > > > > > > >>> > Was able to reproduce once in a fresh
p3 instance with
> > DLAMI
> > > > > can't
> > > > > > > >>> > reproduce consistently.
> > > > > > > >>> >
> > > > > > > >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh
<
> > > anirudh2290@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >>> >
> > > > > > > >>> > > Hi all,
> > > > > > > >>> > >
> > > > > > > >>> > > As part of RC2 release, we have
addressed bugs and some
> > > > > concerns
> > > > > > > that
> > > > > > > >>> > were
> > > > > > > >>> > > raised.
> > > > > > > >>> > >
> > > > > > > >>> > > I would like to propose a vote to
release Apache MXNet
> > > > > > (incubating)
> > > > > > > >>> > version
> > > > > > > >>> > > 1.2.0.RC2. Voting will start now
(Wednesday, May 2nd)
> and
> > > end
> > > > > at
> > > > > > > >>> 12:50 PM
> > > > > > > >>> > > PDT, Sunday, May 6th.
> > > > > > > >>> > >
> > > > > > > >>> > > Link to release notes:
> > > > > > > >>> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > > > > > >>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > > > > > >>> > >
> > > > > > > >>> > > Link to release candidate 1.2.0.rc2:
> > > > > > > >>> > > https://github.com/apache/
> incubator-mxnet/releases/tag/
> > > > > 1.2.0.rc2
> > > > > > > >>> > >
> > > > > > > >>> > > Voting results for 1.2.0.rc2:
> > > > > > > >>> > > https://lists.apache.org/thread.html/
> > > > > > > ebe561c609a8e32351dfe4aafc8876
> > > > > > > >>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > > > > > > >>> > >
> > > > > > > >>> > > View this page, click on "Build
from Source", and use
> the
> > > > > source
> > > > > > > code
> > > > > > > >>> > > obtained from 1.2.0.rc2 tag:
> > > > > > > >>> > > https://mxnet.incubator.apache.org/install/index.html
> > > > > > > >>> > >
> > > > > > > >>> > > (Note: The README.md points to the
1.2.0 tag and does
> not
> > > > work
> > > > > at
> > > > > > > the
> > > > > > > >>> > > moment.)
> > > > > > > >>> > >
> > > > > > > >>> > > Please remember to test first before
voting
> accordingly:
> > > > > > > >>> > >
> > > > > > > >>> > > +1 = approve
> > > > > > > >>> > > +0 = no opinion
> > > > > > > >>> > > -1 = disapprove (provide reason)
> > > > > > > >>> > >
> > > > > > > >>> > > Anirudh
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message