mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lai Wei <roywei...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2
Date Tue, 08 May 2018 00:39:00 GMT
Hi Anirudh,

Update: Did an install on a fresh instance with USE_MKLDNN=1, works fine
now. Pip install with --pre is also working fine.
Problem is the mkl-dnn I installed on the old instance.
Closing the issue <https://github.com/awslabs/keras-apache-mxnet/issues/75>.

Thanks!

Best Regards

Lai Wei

https://www.linkedin.com/pub/lai-wei/2b/731/52b

On Mon, May 7, 2018 at 2:48 PM, Lai Wei <royweilai@gmail.com> wrote:

> Hi Anirudh,
>
> yes, also tried that,  didn't resolve. Looking into root cause and will
> update.
>
> Best Regards
>
> Lai Wei
>
> https://www.linkedin.com/pub/lai-wei/2b/731/52b
>
> On Mon, May 7, 2018 at 2:15 PM, Anirudh <anirudh2290@gmail.com> wrote:
>
>> Hi Lai,
>>
>> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if this is
>> the right flag. Did you try USE_MKLDNN=1 ?
>>
>> Anirudh
>>
>> On Mon, May 7, 2018 at 1:22 PM, Lai Wei <royweilai@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I would like to raise an issue with mxnet-mkl. The keras-mxnet package
>> was
>> > working fine with mxnet-mkl 1.1.0 for training on CPU. However, weights
>> are
>> > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both 'pip
>> install
>> > mxnet-mkl --pre' and built from source from release branch (v1.2.0) with
>> > mkl flag.
>> >
>> > Please refer to this issue for more details:
>> > https://github.com/awslabs/keras-apache-mxnet/issues/75
>> >
>> > There is no code change on keras-mxnet side, so I guess some API broke
>> when
>> > using latest mxnet-mkl. Still working on finding the root cause.
>> >
>> > Thanks
>> >
>> >
>> > Best Regards
>> >
>> > Lai Wei
>> >
>> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
>> >
>> > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin <haibin.lin.aws@gmail.com>
>> > wrote:
>> >
>> > > +1 binding. Build from source with CUDA, ran linear classification
>> > example
>> > > and works fine.
>> > >
>> > > Best.
>> > > Haibin
>> > >
>> > >
>> > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <
>> steffenrochel@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > +1 (non-binding). Tested with selected notebooks from The Straight
>> > Dope.
>> > > > So many important enhancements everybody contributed and our users
>> are
>> > > > waiting for. Hope we will see more votes.
>> > > > Steffen
>> > > > On Mon, May 7, 2018 at 1:07 AM Anirudh <anirudh2290@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > Since we don't have enough binding votes yet, I am extending
the
>> vote
>> > > > till
>> > > > > tomorrow (Monday May 7th), 12:50 PM PDT.
>> > > > >
>> > > > > Anirudh
>> > > > >
>> > > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh <anirudh2290@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Hi Pedro,
>> > > > > >
>> > > > > > Thanks for the clarification. I was able to reproduce the
issue
>> > with
>> > > > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with
Make.
>> > Since
>> > > > the
>> > > > > > issue is not reproducible with make and the customers using
>> > > > > USE_OPENMP=OFF
>> > > > > > with cmake should be small, I agree with you that this should
>> not
>> > be
>> > > a
>> > > > > > blocker. I have added the issue to known issues in release
>> notes:
>> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.
>> 0.rc2
>> > > > > >
>> > > > > > Anirudh
>> > > > > >
>> > > > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
>> > > > > pedro.larroy.lists@gmail.com
>> > > > > > > wrote:
>> > > > > >
>> > > > > >> Agreed, I was not aware that the problems where not
present in
>> the
>> > > > > release
>> > > > > >> branch.
>> > > > > >>
>> > > > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
>> > > haibin.lin.aws@gmail.com>
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >> > I agree with Anirudh that the focus of the discussion
should
>> be
>> > > > > limited
>> > > > > >> to
>> > > > > >> > the release branch, not the master branch. Anything
that
>> breaks
>> > on
>> > > > > >> master
>> > > > > >> > but works on release branch should not block the
release
>> itself.
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > Best,
>> > > > > >> >
>> > > > > >> > Haibin
>> > > > > >> >
>> > > > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
>> > > > > >> > pedro.larroy.lists@gmail.com>
>> > > > > >> > wrote:
>> > > > > >> >
>> > > > > >> > > I see your point.
>> > > > > >> > >
>> > > > > >> > > I checked the failures on the v1.2.0 branch
and I don't see
>> > > > > segfaults,
>> > > > > >> > just
>> > > > > >> > > minor failures due to flaky tests.
>> > > > > >> > >
>> > > > > >> > > I will trigger it repeatedly a few times until
Sunday to
>> have
>> > a
>> > > > and
>> > > > > >> > change
>> > > > > >> > > my vote accordingly.
>> > > > > >> > >
>> > > > > >> > >
>> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
>> > mxnet/job/v1.2.0/
>> > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>> > > organizations/jenkins/
>> > > > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
>> > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>> > > organizations/jenkins/
>> > > > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > Pedro.
>> > > > > >> > >
>> > > > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh <
>> > anirudh2290@gmail.com>
>> > > > > >> wrote:
>> > > > > >> > >
>> > > > > >> > > > Hi Pedro,
>> > > > > >> > > >
>> > > > > >> > > > Thank you for the suggestions. I will
try to reproduce
>> this
>> > > > > without
>> > > > > >> > fixed
>> > > > > >> > > > seeds and also run it for a longer time
duration.
>> > > > > >> > > > Having said that, running unit tests
over and over for a
>> > > couple
>> > > > of
>> > > > > >> days
>> > > > > >> > > > will likely cause
>> > > > > >> > > > problems  because there around 42 open
issues for flaky
>> > tests:
>> > > > > >> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
>> > > > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky
>> > > > > >> > > > Also, the release branch has diverged
from master around
>> 3
>> > > weeks
>> > > > > >> back
>> > > > > >> > and
>> > > > > >> > > > it doesn't have many of the changes merged
to the master.
>> > > > > >> > > > So, my question essentially is, what
will be your
>> benchmark
>> > to
>> > > > > >> accept
>> > > > > >> > the
>> > > > > >> > > > release ?
>> > > > > >> > > > Is it that we run the test which you
provided on 1.2
>> without
>> > > > fixed
>> > > > > >> > seeds
>> > > > > >> > > > and for a longer duration without failures
?
>> > > > > >> > > > Or is it that all unit tests should pass
over a period
>> of 2
>> > > days
>> > > > > >> > without
>> > > > > >> > > > issues. This may require fixing all of
the flaky tests
>> which
>> > > > would
>> > > > > >> > delay
>> > > > > >> > > > the release by considerable amount of
time.
>> > > > > >> > > > Or is it something else ?
>> > > > > >> > > >
>> > > > > >> > > > Anirudh
>> > > > > >> > > >
>> > > > > >> > > >
>> > > > > >> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro
Larroy <
>> > > > > >> > > pedro.larroy.lists@gmail.com
>> > > > > >> > > > >
>> > > > > >> > > > wrote:
>> > > > > >> > > >
>> > > > > >> > > > > Could you remove the fixed seeds
and run it for a
>> couple
>> > of
>> > > > > hours
>> > > > > >> > with
>> > > > > >> > > an
>> > > > > >> > > > > additional loop?  Also I would suggest
running the unit
>> > > tests
>> > > > > over
>> > > > > >> > and
>> > > > > >> > > > over
>> > > > > >> > > > > for a couple of days if possible.
>> > > > > >> > > > >
>> > > > > >> > > > >
>> > > > > >> > > > > Pedro.
>> > > > > >> > > > >
>> > > > > >> > > > > On Thu, May 3, 2018 at 8:33 PM,
Anirudh <
>> > > > anirudh2290@gmail.com>
>> > > > > >> > wrote:
>> > > > > >> > > > >
>> > > > > >> > > > > > Hi Pedro and Naveen,
>> > > > > >> > > > > >
>> > > > > >> > > > > > I am unable to reproduce this
issue with MKLDNN on
>> the
>> > > > master
>> > > > > >> but
>> > > > > >> > not
>> > > > > >> > > > on
>> > > > > >> > > > > > the 1.2.RC2 branch.
>> > > > > >> > > > > >
>> > > > > >> > > > > > Did the following on 1.2.RC2
branch:
>> > > > > >> > > > > >
>> > > > > >> > > > > > make -j $(nproc) USE_OPENCV=1
USE_BLAS=openblas
>> > > > > >> USE_DIST_KVSTORE=0
>> > > > > >> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
>> > > > > >> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
>> > > > > >> > > > > > export MXNET_TEST_SEED=11
>> > > > > >> > > > > > export MXNET_MODULE_SEED=812478194
>> > > > > >> > > > > > export MXNET_TEST_COUNT=10000
>> > > > > >> > > > > > nosetests-2.7 -v tests/python/unittest/test_
>> > > > > >> > > > > module.py:test_forward_reshape
>> > > > > >> > > > > >
>> > > > > >> > > > > > Was able to do the 10k runs
successfully.
>> > > > > >> > > > > >
>> > > > > >> > > > > > Anirudh
>> > > > > >> > > > > >
>> > > > > >> > > > > > On Thu, May 3, 2018 at 8:46
AM, Anirudh <
>> > > > > anirudh2290@gmail.com>
>> > > > > >> > > wrote:
>> > > > > >> > > > > >
>> > > > > >> > > > > > > Hi Pedro and Naveen,
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Is this issue reproducible
when MXNet is built with
>> > > > > >> USE_MKLDNN=0?
>> > > > > >> > > > > > > Also, there are a bunch
of MKLDNN fixes that
>> didn't go
>> > > > into
>> > > > > >> the
>> > > > > >> > > > release
>> > > > > >> > > > > > > branch. Is this issue
reproducible on the release
>> > > branch ?
>> > > > > >> > > > > > > In my opinion, since we
have marked MKLDNN as
>> > > experimental
>> > > > > >> > feature
>> > > > > >> > > > for
>> > > > > >> > > > > > the
>> > > > > >> > > > > > > release, if it is confirmed
to be a MKLDNN issue
>> > > > > >> > > > > > > we don't need to block
the release on it.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Anirudh
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > On Thu, May 3, 2018 at
6:58 AM, Naveen Swamy <
>> > > > > >> mnnaveen@gmail.com
>> > > > > >> > >
>> > > > > >> > > > > wrote:
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >> Thanks for raising
this issue Pedro.
>> > > > > >> > > > > > >>
>> > > > > >> > > > > > >> -1(binding)
>> > > > > >> > > > > > >>
>> > > > > >> > > > > > >> We were in a similar
state for a while a year
>> ago, a
>> > > lot
>> > > > of
>> > > > > >> > effort
>> > > > > >> > > > > went
>> > > > > >> > > > > > to
>> > > > > >> > > > > > >> stabilize the tests
and the CI. I have seen the PR
>> > > builds
>> > > > > are
>> > > > > >> > > > > > >> non-deterministic
and you have to retry over and
>> over
>> > > > > >> (wasting
>> > > > > >> > > > > resources
>> > > > > >> > > > > > >> and time) and hope
you get lucky.
>> > > > > >> > > > > > >>
>> > > > > >> > > > > > >> Look at the dashboard
for master build
>> > > > > >> > > > > > >> http://jenkins.mxnet-ci.amazon
>> -ml.com/job/incubator-
>> > > > > >> > > > mxnet/job/master/
>> > > > > >> > > > > > >>
>> > > > > >> > > > > > >> -Naveen
>> > > > > >> > > > > > >>
>> > > > > >> > > > > > >> On Thu, May 3, 2018
at 5:11 AM, Pedro Larroy <
>> > > > > >> > > > > > >> pedro.larroy.lists@gmail.com>
>> > > > > >> > > > > > >> wrote:
>> > > > > >> > > > > > >>
>> > > > > >> > > > > > >> > -1  nondeterminisitc
failures on CI master:
>> > > > > >> > > > > > >> > https://issues.apache.org/jira/browse/MXNET-396
>> > > > > >> > > > > > >> >
>> > > > > >> > > > > > >> > Was able to reproduce
once in a fresh p3
>> instance
>> > > with
>> > > > > >> DLAMI
>> > > > > >> > > > can't
>> > > > > >> > > > > > >> > reproduce consistently.
>> > > > > >> > > > > > >> >
>> > > > > >> > > > > > >> > On Wed, May 2,
2018 at 9:51 PM, Anirudh <
>> > > > > >> > anirudh2290@gmail.com>
>> > > > > >> > > > > > wrote:
>> > > > > >> > > > > > >> >
>> > > > > >> > > > > > >> > > Hi all,
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > As part
of RC2 release, we have addressed bugs
>> > and
>> > > > some
>> > > > > >> > > concerns
>> > > > > >> > > > > > that
>> > > > > >> > > > > > >> > were
>> > > > > >> > > > > > >> > > raised.
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > I would
like to propose a vote to release
>> Apache
>> > > > MXNet
>> > > > > >> > > > > (incubating)
>> > > > > >> > > > > > >> > version
>> > > > > >> > > > > > >> > > 1.2.0.RC2.
Voting will start now (Wednesday,
>> May
>> > > 2nd)
>> > > > > and
>> > > > > >> > end
>> > > > > >> > > at
>> > > > > >> > > > > > >> 12:50 PM
>> > > > > >> > > > > > >> > > PDT, Sunday,
May 6th.
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > Link to
release notes:
>> > > > > >> > > > > > >> > > https://cwiki.apache.org/
>> > confluence/display/MXNET/
>> > > > > >> > > > > > >> > > Apache+MXNet+%28incubating%29+
>> > 1.2.0+Release+Notes
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > Link to
release candidate 1.2.0.rc2:
>> > > > > >> > > > > > >> > >
>> > > > > https://github.com/apache/incubator-mxnet/releases/tag/
>> > > > > >> > > > 1.2.0.rc2
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > Voting results
for 1.2.0.rc2:
>> > > > > >> > > > > > >> > > https://lists.apache.org/thread.html/
>> > > > > >> > > > > ebe561c609a8e32351dfe4aafc8876
>> > > > > >> > > > > > >> > > 199560336472726b58c3455e85@%3C
>> > dev.mxnet.apache.org
>> > > > %3E
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > View this
page, click on "Build from Source",
>> and
>> > > use
>> > > > > the
>> > > > > >> > > source
>> > > > > >> > > > > > code
>> > > > > >> > > > > > >> > > obtained
from 1.2.0.rc2 tag:
>> > > > > >> > > > > > >> > > https://mxnet.incubator.
>> > > > apache.org/install/index.html
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > (Note: The
README.md points to the 1.2.0 tag
>> and
>> > > does
>> > > > > not
>> > > > > >> > work
>> > > > > >> > > > at
>> > > > > >> > > > > > the
>> > > > > >> > > > > > >> > > moment.)
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > Please remember
to test first before voting
>> > > > > accordingly:
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > +1 = approve
>> > > > > >> > > > > > >> > > +0 = no
opinion
>> > > > > >> > > > > > >> > > -1 = disapprove
(provide reason)
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> > > Anirudh
>> > > > > >> > > > > > >> > >
>> > > > > >> > > > > > >> >
>> > > > > >> > > > > > >>
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > >
>> > > > > >> > > > >
>> > > > > >> > > >
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message