mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naveen Swamy <mnnav...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2
Date Thu, 03 May 2018 21:32:56 GMT
+ 0

On Thu, May 3, 2018 at 12:44 PM, Anirudh <anirudh2290@gmail.com> wrote:

> Hi Naveen,
>
> You raise a good point and I agree that by default MKLDNN by default should
> be switched off.
> Because of a bug in Cmakelists.txt which has been fixed as part of #10731,
> which is merged to master (but not on the release branch),
> the users won't have MKLDNN enabled even though MKLDNN is set to ON by
> default.
> Since cmake install instructions have not been published in mxnet.io and
> is
> lesser used compared to pip package and make from what I have seen,
>  would it be acceptable for you if this is added as a known issue and a
> workaround is provided in the release notes ?
> The impacted users(who will have to use the workaround) here would be the
> customers who are interested in MKLDNN and use cmake, users who aren't
> interested in using MKLDNN feature won't be impacted.
>
> Anirudh
>
> On Thu, May 3, 2018 at 12:16 PM, Marco de Abreu <
> marco.g.abreu@googlemail.com> wrote:
>
> > The MKLDNN tests are not really less stable than the other tests. It's
> > pretty much the same across all tests we have. So I wouldn't say there's
> a
> > need to fix them in a separate branch.
> >
> > On Thu, May 3, 2018 at 9:00 PM, Naveen Swamy <mnnaveen@gmail.com> wrote:
> >
> > > I also meant(but forgot to send), we stabilize it on a separate branch
> > and
> > > then bring in the changes instead of blocking the PRs.
> > >
> > > On Thu, May 3, 2018 at 11:57 AM, Marco de Abreu <
> > > marco.g.abreu@googlemail.com> wrote:
> > >
> > > > I think the failing tests are really getting an issue. We now got
> > roughly
> > > > 50 test failure related issues [1], leading to a average failure rate
> > of
> > > > 50%. Considering the costs in terms of money and time per run, this
> is
> > > > adding up quite significantly.
> > > >
> > > > Didn't we just remove MKLML from our codebase to replace it with
> > MKLDNN?
> > > I
> > > > think removing something and marking the replacement as experimental
> > > could
> > > > be difficult from a user perspective. Personally, I don't really feel
> > > > comfortable solving the problem of known issues by marking something
> as
> > > > experimental. We're basically shifting the responsibility to our
> users
> > > that
> > > > way.
> > > >
> > > > I don't think we should stop testing MKLDNN in our CI. We already had
> > the
> > > > situation a few months ago where the solution to failed tests was to
> > > > disable them. We shouldn't go back to that.
> > > >
> > > > -Marco
> > > >
> > > > [1]:
> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > 3Aopen+is%3Aissue+label%3ATest
> > > >
> > > > On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy <mnnaveen@gmail.com>
> > wrote:
> > > >
> > > > > USE_MKLDNN is set to ON in the cmake file by default, since its
> > > > > experimental can we turn OFF  so there is some determinism when
> users
> > > > build
> > > > > and test.
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/blob/
> > > > > 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On a separate note, since MKLDNN is experimental can we stop
> building
> > > on
> > > > > master and cause PR's to queue up.
> > > > >
> > > > >
> > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh <anirudh2290@gmail.com>
> > > wrote:
> > > > >
> > > > > > Correction: I was able to reproduce the issue with MKLDNN enabled
> > on
> > > > > > master, but not on 1.2 branch.
> > > > > >
> > > > > > On Thu, May 3, 2018 at 11:33 AM, Anirudh <anirudh2290@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Pedro and Naveen,
> > > > > > >
> > > > > > > I am unable to reproduce this issue with MKLDNN on the
master
> but
> > > not
> > > > > on
> > > > > > > the 1.2.RC2 branch.
> > > > > > >
> > > > > > > Did the following on 1.2.RC2 branch:
> > > > > > >
> > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > USE_DIST_KVSTORE=0
> > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > > export MXNET_TEST_SEED=11
> > > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > > export MXNET_TEST_COUNT=10000
> > > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > > module.py:test_forward_reshape
> > > > > > >
> > > > > > > Was able to do the 10k runs successfully.
> > > > > > >
> > > > > > > Anirudh
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <anirudh2290@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > >> Hi Pedro and Naveen,
> > > > > > >>
> > > > > > >> Is this issue reproducible when MXNet is built with
> > USE_MKLDNN=0?
> > > > > > >> Also, there are a bunch of MKLDNN fixes that didn't
go into
> the
> > > > > release
> > > > > > >> branch. Is this issue reproducible on the release branch
?
> > > > > > >> In my opinion, since we have marked MKLDNN as experimental
> > feature
> > > > for
> > > > > > >> the release, if it is confirmed to be a MKLDNN issue
> > > > > > >> we don't need to block the release on it.
> > > > > > >>
> > > > > > >> Anirudh
> > > > > > >>
> > > > > > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy <
> > mnnaveen@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >>> Thanks for raising this issue Pedro.
> > > > > > >>>
> > > > > > >>> -1(binding)
> > > > > > >>>
> > > > > > >>> We were in a similar state for a while a year ago,
a lot of
> > > effort
> > > > > went
> > > > > > >>> to
> > > > > > >>> stabilize the tests and the CI. I have seen the
PR builds are
> > > > > > >>> non-deterministic and you have to retry over and
over
> (wasting
> > > > > > resources
> > > > > > >>> and time) and hope you get lucky.
> > > > > > >>>
> > > > > > >>> Look at the dashboard for master build
> > > > > > >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > > > mxnet/job/master/
> > > > > > >>>
> > > > > > >>> -Naveen
> > > > > > >>>
> > > > > > >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > > > > > >>> pedro.larroy.lists@gmail.com>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> > -1  nondeterminisitc failures on CI master:
> > > > > > >>> > https://issues.apache.org/jira/browse/MXNET-396
> > > > > > >>> >
> > > > > > >>> > Was able to reproduce once in a fresh p3 instance
with
> DLAMI
> > > > can't
> > > > > > >>> > reproduce consistently.
> > > > > > >>> >
> > > > > > >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh <
> > anirudh2290@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>> >
> > > > > > >>> > > Hi all,
> > > > > > >>> > >
> > > > > > >>> > > As part of RC2 release, we have addressed
bugs and some
> > > > concerns
> > > > > > that
> > > > > > >>> > were
> > > > > > >>> > > raised.
> > > > > > >>> > >
> > > > > > >>> > > I would like to propose a vote to release
Apache MXNet
> > > > > (incubating)
> > > > > > >>> > version
> > > > > > >>> > > 1.2.0.RC2. Voting will start now (Wednesday,
May 2nd) and
> > end
> > > > at
> > > > > > >>> 12:50 PM
> > > > > > >>> > > PDT, Sunday, May 6th.
> > > > > > >>> > >
> > > > > > >>> > > Link to release notes:
> > > > > > >>> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > > > > >>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > > > > >>> > >
> > > > > > >>> > > Link to release candidate 1.2.0.rc2:
> > > > > > >>> > > https://github.com/apache/incubator-mxnet/releases/tag/
> > > > 1.2.0.rc2
> > > > > > >>> > >
> > > > > > >>> > > Voting results for 1.2.0.rc2:
> > > > > > >>> > > https://lists.apache.org/thread.html/
> > > > > > ebe561c609a8e32351dfe4aafc8876
> > > > > > >>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > > > > > >>> > >
> > > > > > >>> > > View this page, click on "Build from
Source", and use the
> > > > source
> > > > > > code
> > > > > > >>> > > obtained from 1.2.0.rc2 tag:
> > > > > > >>> > > https://mxnet.incubator.apache.org/install/index.html
> > > > > > >>> > >
> > > > > > >>> > > (Note: The README.md points to the 1.2.0
tag and does not
> > > work
> > > > at
> > > > > > the
> > > > > > >>> > > moment.)
> > > > > > >>> > >
> > > > > > >>> > > Please remember to test first before
voting accordingly:
> > > > > > >>> > >
> > > > > > >>> > > +1 = approve
> > > > > > >>> > > +0 = no opinion
> > > > > > >>> > > -1 = disapprove (provide reason)
> > > > > > >>> > >
> > > > > > >>> > > Anirudh
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message