mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lai Wei <roywei...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2
Date Mon, 07 May 2018 20:22:46 GMT
Hi,

I would like to raise an issue with mxnet-mkl. The keras-mxnet package was
working fine with mxnet-mkl 1.1.0 for training on CPU. However, weights are
not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip install
mxnet-mkl --pre' and built from source from release branch (v1.2.0) with
mkl flag.

Please refer to this issue for more details:
https://github.com/awslabs/keras-apache-mxnet/issues/75

There is no code change on keras-mxnet side, so I guess some API broke when
using latest mxnet-mkl. Still working on finding the root cause.

Thanks


Best Regards

Lai Wei

https://www.linkedin.com/pub/lai-wei/2b/731/52b

On Mon, May 7, 2018 at 10:38 AM, Haibin Lin <haibin.lin.aws@gmail.com>
wrote:

> +1 binding. Build from source with CUDA, ran linear classification example
> and works fine.
>
> Best.
> Haibin
>
>
> On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <steffenrochel@gmail.com>
> wrote:
>
> > +1 (non-binding). Tested with selected notebooks from The Straight Dope.
> > So many important enhancements everybody contributed and our users are
> > waiting for. Hope we will see more votes.
> > Steffen
> > On Mon, May 7, 2018 at 1:07 AM Anirudh <anirudh2290@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Since we don't have enough binding votes yet, I am extending the vote
> > till
> > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > >
> > > Anirudh
> > >
> > > On Sun, May 6, 2018 at 4:05 PM, Anirudh <anirudh2290@gmail.com> wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > Thanks for the clarification. I was able to reproduce the issue with
> > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. Since
> > the
> > > > issue is not reproducible with make and the customers using
> > > USE_OPENMP=OFF
> > > > with cmake should be small, I agree with you that this should not be
> a
> > > > blocker. I have added the issue to known issues in release notes:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > >
> > > > Anirudh
> > > >
> > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > > wrote:
> > > >
> > > >> Agreed, I was not aware that the problems where not present in the
> > > release
> > > >> branch.
> > > >>
> > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
> haibin.lin.aws@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > I agree with Anirudh that the focus of the discussion should
be
> > > limited
> > > >> to
> > > >> > the release branch, not the master branch. Anything that breaks
on
> > > >> master
> > > >> > but works on release branch should not block the release itself.
> > > >> >
> > > >> >
> > > >> > Best,
> > > >> >
> > > >> > Haibin
> > > >> >
> > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > >> > pedro.larroy.lists@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > I see your point.
> > > >> > >
> > > >> > > I checked the failures on the v1.2.0 branch and I don't
see
> > > segfaults,
> > > >> > just
> > > >> > > minor failures due to flaky tests.
> > > >> > >
> > > >> > > I will trigger it repeatedly a few times until Sunday to
have a
> > and
> > > >> > change
> > > >> > > my vote accordingly.
> > > >> > >
> > > >> > >
> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > >> > >
> > > >> > >
> > > >> > > Pedro.
> > > >> > >
> > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh <anirudh2290@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi Pedro,
> > > >> > > >
> > > >> > > > Thank you for the suggestions. I will try to reproduce
this
> > > without
> > > >> > fixed
> > > >> > > > seeds and also run it for a longer time duration.
> > > >> > > > Having said that, running unit tests over and over
for a
> couple
> > of
> > > >> days
> > > >> > > > will likely cause
> > > >> > > > problems  because there around 42 open issues for flaky
tests:
> > > >> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > >> > > > Also, the release branch has diverged from master around
3
> weeks
> > > >> back
> > > >> > and
> > > >> > > > it doesn't have many of the changes merged to the master.
> > > >> > > > So, my question essentially is, what will be your benchmark
to
> > > >> accept
> > > >> > the
> > > >> > > > release ?
> > > >> > > > Is it that we run the test which you provided on 1.2
without
> > fixed
> > > >> > seeds
> > > >> > > > and for a longer duration without failures ?
> > > >> > > > Or is it that all unit tests should pass over a period
of 2
> days
> > > >> > without
> > > >> > > > issues. This may require fixing all of the flaky tests
which
> > would
> > > >> > delay
> > > >> > > > the release by considerable amount of time.
> > > >> > > > Or is it something else ?
> > > >> > > >
> > > >> > > > Anirudh
> > > >> > > >
> > > >> > > >
> > > >> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > >> > > pedro.larroy.lists@gmail.com
> > > >> > > > >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Could you remove the fixed seeds and run it for
a couple of
> > > hours
> > > >> > with
> > > >> > > an
> > > >> > > > > additional loop?  Also I would suggest running
the unit
> tests
> > > over
> > > >> > and
> > > >> > > > over
> > > >> > > > > for a couple of days if possible.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > Pedro.
> > > >> > > > >
> > > >> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh <
> > anirudh2290@gmail.com>
> > > >> > wrote:
> > > >> > > > >
> > > >> > > > > > Hi Pedro and Naveen,
> > > >> > > > > >
> > > >> > > > > > I am unable to reproduce this issue with
MKLDNN on the
> > master
> > > >> but
> > > >> > not
> > > >> > > > on
> > > >> > > > > > the 1.2.RC2 branch.
> > > >> > > > > >
> > > >> > > > > > Did the following on 1.2.RC2 branch:
> > > >> > > > > >
> > > >> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > > >> USE_DIST_KVSTORE=0
> > > >> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > >> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > >> > > > > > export MXNET_TEST_SEED=11
> > > >> > > > > > export MXNET_MODULE_SEED=812478194
> > > >> > > > > > export MXNET_TEST_COUNT=10000
> > > >> > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > >> > > > > module.py:test_forward_reshape
> > > >> > > > > >
> > > >> > > > > > Was able to do the 10k runs successfully.
> > > >> > > > > >
> > > >> > > > > > Anirudh
> > > >> > > > > >
> > > >> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <
> > > anirudh2290@gmail.com>
> > > >> > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Pedro and Naveen,
> > > >> > > > > > >
> > > >> > > > > > > Is this issue reproducible when MXNet
is built with
> > > >> USE_MKLDNN=0?
> > > >> > > > > > > Also, there are a bunch of MKLDNN fixes
that didn't go
> > into
> > > >> the
> > > >> > > > release
> > > >> > > > > > > branch. Is this issue reproducible on
the release
> branch ?
> > > >> > > > > > > In my opinion, since we have marked
MKLDNN as
> experimental
> > > >> > feature
> > > >> > > > for
> > > >> > > > > > the
> > > >> > > > > > > release, if it is confirmed to be a
MKLDNN issue
> > > >> > > > > > > we don't need to block the release on
it.
> > > >> > > > > > >
> > > >> > > > > > > Anirudh
> > > >> > > > > > >
> > > >> > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen
Swamy <
> > > >> mnnaveen@gmail.com
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> Thanks for raising this issue Pedro.
> > > >> > > > > > >>
> > > >> > > > > > >> -1(binding)
> > > >> > > > > > >>
> > > >> > > > > > >> We were in a similar state for a
while a year ago, a
> lot
> > of
> > > >> > effort
> > > >> > > > > went
> > > >> > > > > > to
> > > >> > > > > > >> stabilize the tests and the CI.
I have seen the PR
> builds
> > > are
> > > >> > > > > > >> non-deterministic and you have to
retry over and over
> > > >> (wasting
> > > >> > > > > resources
> > > >> > > > > > >> and time) and hope you get lucky.
> > > >> > > > > > >>
> > > >> > > > > > >> Look at the dashboard for master
build
> > > >> > > > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > >> > > > mxnet/job/master/
> > > >> > > > > > >>
> > > >> > > > > > >> -Naveen
> > > >> > > > > > >>
> > > >> > > > > > >> On Thu, May 3, 2018 at 5:11 AM,
Pedro Larroy <
> > > >> > > > > > >> pedro.larroy.lists@gmail.com>
> > > >> > > > > > >> wrote:
> > > >> > > > > > >>
> > > >> > > > > > >> > -1  nondeterminisitc failures
on CI master:
> > > >> > > > > > >> > https://issues.apache.org/jira/browse/MXNET-396
> > > >> > > > > > >> >
> > > >> > > > > > >> > Was able to reproduce once
in a fresh p3 instance
> with
> > > >> DLAMI
> > > >> > > > can't
> > > >> > > > > > >> > reproduce consistently.
> > > >> > > > > > >> >
> > > >> > > > > > >> > On Wed, May 2, 2018 at 9:51
PM, Anirudh <
> > > >> > anirudh2290@gmail.com>
> > > >> > > > > > wrote:
> > > >> > > > > > >> >
> > > >> > > > > > >> > > Hi all,
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > As part of RC2 release,
we have addressed bugs and
> > some
> > > >> > > concerns
> > > >> > > > > > that
> > > >> > > > > > >> > were
> > > >> > > > > > >> > > raised.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > I would like to propose
a vote to release Apache
> > MXNet
> > > >> > > > > (incubating)
> > > >> > > > > > >> > version
> > > >> > > > > > >> > > 1.2.0.RC2. Voting will
start now (Wednesday, May
> 2nd)
> > > and
> > > >> > end
> > > >> > > at
> > > >> > > > > > >> 12:50 PM
> > > >> > > > > > >> > > PDT, Sunday, May 6th.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Link to release notes:
> > > >> > > > > > >> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > >> > > > > > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Link to release candidate
1.2.0.rc2:
> > > >> > > > > > >> > >
> > > https://github.com/apache/incubator-mxnet/releases/tag/
> > > >> > > > 1.2.0.rc2
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Voting results for 1.2.0.rc2:
> > > >> > > > > > >> > > https://lists.apache.org/thread.html/
> > > >> > > > > ebe561c609a8e32351dfe4aafc8876
> > > >> > > > > > >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org
> > %3E
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > View this page, click
on "Build from Source", and
> use
> > > the
> > > >> > > source
> > > >> > > > > > code
> > > >> > > > > > >> > > obtained from 1.2.0.rc2
tag:
> > > >> > > > > > >> > > https://mxnet.incubator.
> > apache.org/install/index.html
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > (Note: The README.md points
to the 1.2.0 tag and
> does
> > > not
> > > >> > work
> > > >> > > > at
> > > >> > > > > > the
> > > >> > > > > > >> > > moment.)
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Please remember to test
first before voting
> > > accordingly:
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > +1 = approve
> > > >> > > > > > >> > > +0 = no opinion
> > > >> > > > > > >> > > -1 = disapprove (provide
reason)
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Anirudh
> > > >> > > > > > >> > >
> > > >> > > > > > >> >
> > > >> > > > > > >>
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message