mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com>
Subject Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2
Date Mon, 07 May 2018 20:16:33 GMT
Sorry everybody, but it seems like our ARM64/Jetson build was just broken
by the creators of our base crosscompile Dockerfile called 'dockcross'.
This is one of our base images, used to cross-compile ARM64 (Jetson
specifically). The owners merged the PR two days ago at [1] which led to
our build-pipeline for Jetson devices to break (the OpenBLAS dependency to
be specific). Releasing the MXNet at the current state will mean that we
release it non-buildable for Jetson devices.

The reason this was not discovered by our CI yet was the matter of the fact
that this is the base image which is cached on all of our slaves. We do
this on purpose to ensure a consistent environment without our entire CI
suddenly crashing because of a third party updates like this one. I have
just discovered this problem on our test environment which is working
without caches. To track this case, I have created an issue at [2].
Unfortunately, this was unavoidable since the project does not maintain any
tagging or versioning scheme for their Dockerfiles [3] - instead, they
automatically push to production.....

-1 from my side until this has been resolved.

-Marco

[1]: https://github.com/dockcross/dockcross/pull/221
[2]: https://github.com/apache/incubator-mxnet/issues/10837
[3]: https://microbadger.com/images/dockcross/linux-arm64


On Mon, May 7, 2018 at 7:38 PM, Haibin Lin <haibin.lin.aws@gmail.com> wrote:

> +1 binding. Build from source with CUDA, ran linear classification example
> and works fine.
>
> Best.
> Haibin
>
>
> On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <steffenrochel@gmail.com>
> wrote:
>
> > +1 (non-binding). Tested with selected notebooks from The Straight Dope.
> > So many important enhancements everybody contributed and our users are
> > waiting for. Hope we will see more votes.
> > Steffen
> > On Mon, May 7, 2018 at 1:07 AM Anirudh <anirudh2290@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Since we don't have enough binding votes yet, I am extending the vote
> > till
> > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > >
> > > Anirudh
> > >
> > > On Sun, May 6, 2018 at 4:05 PM, Anirudh <anirudh2290@gmail.com> wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > Thanks for the clarification. I was able to reproduce the issue with
> > > > USE_OPENMP=OFF. I wasn't able to reproduce the issue with Make. Since
> > the
> > > > issue is not reproducible with make and the customers using
> > > USE_OPENMP=OFF
> > > > with cmake should be small, I agree with you that this should not be
> a
> > > > blocker. I have added the issue to known issues in release notes:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > >
> > > > Anirudh
> > > >
> > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > > wrote:
> > > >
> > > >> Agreed, I was not aware that the problems where not present in the
> > > release
> > > >> branch.
> > > >>
> > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin Lin <
> haibin.lin.aws@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > I agree with Anirudh that the focus of the discussion should
be
> > > limited
> > > >> to
> > > >> > the release branch, not the master branch. Anything that breaks
on
> > > >> master
> > > >> > but works on release branch should not block the release itself.
> > > >> >
> > > >> >
> > > >> > Best,
> > > >> >
> > > >> > Haibin
> > > >> >
> > > >> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > >> > pedro.larroy.lists@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > I see your point.
> > > >> > >
> > > >> > > I checked the failures on the v1.2.0 branch and I don't
see
> > > segfaults,
> > > >> > just
> > > >> > > minor failures due to flaky tests.
> > > >> > >
> > > >> > > I will trigger it repeatedly a few times until Sunday to
have a
> > and
> > > >> > change
> > > >> > > my vote accordingly.
> > > >> > >
> > > >> > >
> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > >> > >
> > > >> > >
> > > >> > > Pedro.
> > > >> > >
> > > >> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh <anirudh2290@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi Pedro,
> > > >> > > >
> > > >> > > > Thank you for the suggestions. I will try to reproduce
this
> > > without
> > > >> > fixed
> > > >> > > > seeds and also run it for a longer time duration.
> > > >> > > > Having said that, running unit tests over and over
for a
> couple
> > of
> > > >> days
> > > >> > > > will likely cause
> > > >> > > > problems  because there around 42 open issues for flaky
tests:
> > > >> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > >> > > > Also, the release branch has diverged from master around
3
> weeks
> > > >> back
> > > >> > and
> > > >> > > > it doesn't have many of the changes merged to the master.
> > > >> > > > So, my question essentially is, what will be your benchmark
to
> > > >> accept
> > > >> > the
> > > >> > > > release ?
> > > >> > > > Is it that we run the test which you provided on 1.2
without
> > fixed
> > > >> > seeds
> > > >> > > > and for a longer duration without failures ?
> > > >> > > > Or is it that all unit tests should pass over a period
of 2
> days
> > > >> > without
> > > >> > > > issues. This may require fixing all of the flaky tests
which
> > would
> > > >> > delay
> > > >> > > > the release by considerable amount of time.
> > > >> > > > Or is it something else ?
> > > >> > > >
> > > >> > > > Anirudh
> > > >> > > >
> > > >> > > >
> > > >> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > >> > > pedro.larroy.lists@gmail.com
> > > >> > > > >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Could you remove the fixed seeds and run it for
a couple of
> > > hours
> > > >> > with
> > > >> > > an
> > > >> > > > > additional loop?  Also I would suggest running
the unit
> tests
> > > over
> > > >> > and
> > > >> > > > over
> > > >> > > > > for a couple of days if possible.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > Pedro.
> > > >> > > > >
> > > >> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh <
> > anirudh2290@gmail.com>
> > > >> > wrote:
> > > >> > > > >
> > > >> > > > > > Hi Pedro and Naveen,
> > > >> > > > > >
> > > >> > > > > > I am unable to reproduce this issue with
MKLDNN on the
> > master
> > > >> but
> > > >> > not
> > > >> > > > on
> > > >> > > > > > the 1.2.RC2 branch.
> > > >> > > > > >
> > > >> > > > > > Did the following on 1.2.RC2 branch:
> > > >> > > > > >
> > > >> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > > >> USE_DIST_KVSTORE=0
> > > >> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > >> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > >> > > > > > export MXNET_TEST_SEED=11
> > > >> > > > > > export MXNET_MODULE_SEED=812478194
> > > >> > > > > > export MXNET_TEST_COUNT=10000
> > > >> > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > >> > > > > module.py:test_forward_reshape
> > > >> > > > > >
> > > >> > > > > > Was able to do the 10k runs successfully.
> > > >> > > > > >
> > > >> > > > > > Anirudh
> > > >> > > > > >
> > > >> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <
> > > anirudh2290@gmail.com>
> > > >> > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Pedro and Naveen,
> > > >> > > > > > >
> > > >> > > > > > > Is this issue reproducible when MXNet
is built with
> > > >> USE_MKLDNN=0?
> > > >> > > > > > > Also, there are a bunch of MKLDNN fixes
that didn't go
> > into
> > > >> the
> > > >> > > > release
> > > >> > > > > > > branch. Is this issue reproducible on
the release
> branch ?
> > > >> > > > > > > In my opinion, since we have marked
MKLDNN as
> experimental
> > > >> > feature
> > > >> > > > for
> > > >> > > > > > the
> > > >> > > > > > > release, if it is confirmed to be a
MKLDNN issue
> > > >> > > > > > > we don't need to block the release on
it.
> > > >> > > > > > >
> > > >> > > > > > > Anirudh
> > > >> > > > > > >
> > > >> > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen
Swamy <
> > > >> mnnaveen@gmail.com
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> Thanks for raising this issue Pedro.
> > > >> > > > > > >>
> > > >> > > > > > >> -1(binding)
> > > >> > > > > > >>
> > > >> > > > > > >> We were in a similar state for a
while a year ago, a
> lot
> > of
> > > >> > effort
> > > >> > > > > went
> > > >> > > > > > to
> > > >> > > > > > >> stabilize the tests and the CI.
I have seen the PR
> builds
> > > are
> > > >> > > > > > >> non-deterministic and you have to
retry over and over
> > > >> (wasting
> > > >> > > > > resources
> > > >> > > > > > >> and time) and hope you get lucky.
> > > >> > > > > > >>
> > > >> > > > > > >> Look at the dashboard for master
build
> > > >> > > > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > >> > > > mxnet/job/master/
> > > >> > > > > > >>
> > > >> > > > > > >> -Naveen
> > > >> > > > > > >>
> > > >> > > > > > >> On Thu, May 3, 2018 at 5:11 AM,
Pedro Larroy <
> > > >> > > > > > >> pedro.larroy.lists@gmail.com>
> > > >> > > > > > >> wrote:
> > > >> > > > > > >>
> > > >> > > > > > >> > -1  nondeterminisitc failures
on CI master:
> > > >> > > > > > >> > https://issues.apache.org/jira/browse/MXNET-396
> > > >> > > > > > >> >
> > > >> > > > > > >> > Was able to reproduce once
in a fresh p3 instance
> with
> > > >> DLAMI
> > > >> > > > can't
> > > >> > > > > > >> > reproduce consistently.
> > > >> > > > > > >> >
> > > >> > > > > > >> > On Wed, May 2, 2018 at 9:51
PM, Anirudh <
> > > >> > anirudh2290@gmail.com>
> > > >> > > > > > wrote:
> > > >> > > > > > >> >
> > > >> > > > > > >> > > Hi all,
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > As part of RC2 release,
we have addressed bugs and
> > some
> > > >> > > concerns
> > > >> > > > > > that
> > > >> > > > > > >> > were
> > > >> > > > > > >> > > raised.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > I would like to propose
a vote to release Apache
> > MXNet
> > > >> > > > > (incubating)
> > > >> > > > > > >> > version
> > > >> > > > > > >> > > 1.2.0.RC2. Voting will
start now (Wednesday, May
> 2nd)
> > > and
> > > >> > end
> > > >> > > at
> > > >> > > > > > >> 12:50 PM
> > > >> > > > > > >> > > PDT, Sunday, May 6th.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Link to release notes:
> > > >> > > > > > >> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > >> > > > > > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Link to release candidate
1.2.0.rc2:
> > > >> > > > > > >> > >
> > > https://github.com/apache/incubator-mxnet/releases/tag/
> > > >> > > > 1.2.0.rc2
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Voting results for 1.2.0.rc2:
> > > >> > > > > > >> > > https://lists.apache.org/thread.html/
> > > >> > > > > ebe561c609a8e32351dfe4aafc8876
> > > >> > > > > > >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org
> > %3E
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > View this page, click
on "Build from Source", and
> use
> > > the
> > > >> > > source
> > > >> > > > > > code
> > > >> > > > > > >> > > obtained from 1.2.0.rc2
tag:
> > > >> > > > > > >> > > https://mxnet.incubator.
> > apache.org/install/index.html
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > (Note: The README.md points
to the 1.2.0 tag and
> does
> > > not
> > > >> > work
> > > >> > > > at
> > > >> > > > > > the
> > > >> > > > > > >> > > moment.)
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Please remember to test
first before voting
> > > accordingly:
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > +1 = approve
> > > >> > > > > > >> > > +0 = no opinion
> > > >> > > > > > >> > > -1 = disapprove (provide
reason)
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > Anirudh
> > > >> > > > > > >> > >
> > > >> > > > > > >> >
> > > >> > > > > > >>
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message