mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com>
Subject Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2
Date Tue, 08 May 2018 13:37:22 GMT
Yes, sorry for the inconvenience! We fixed the root cause and everything
should be back to normal.

-Marco

Steffen Rochel <steffenrochel@gmail.com> schrieb am Di., 8. Mai 2018, 14:59:

> Marco - thanks for your efforts. Does this unblock the Apache MXNet v1.2
> release and change your vote?
>
> On Tue, May 8, 2018 at 3:00 AM Marco de Abreu <
> marco.g.abreu@googlemail.com>
> wrote:
>
> > Small update regarding the ARM64 builds. I have created two pull requests
> > [1][2] which changed the repository to a mirror I created. This mirror
> was
> > created using a cached version of the working Docker image, effectively
> > reverting the state back to a working one. At the same time, this pins
> the
> > container to prevent any further problems.
> >
> > I would prefer to use the public repository instead of our own mirror,
> but
> > for now, this is inevitable. If anybody would like to be added to the
> > Dockerhub organization "mxnetci", feel free to let me know! To prevent
> > problems like these in future, I created a feature request at [3] to
> ensure
> > future releases of that Dockerimage are properly tagged. Additionally,
> the
> > creator of the failing PR is aware and actively involved in creating a
> > permanent solution [4].
> >
> > Best regards,
> > Marco
> >
> > [1]: https://github.com/apache/incubator-mxnet/pull/10850
> > [2]: https://github.com/apache/incubator-mxnet/pull/10849
> > [3]: https://github.com/dockcross/dockcross/issues/223
> > [4]: https://github.com/dockcross/dockcross/pull/221
> >
> > On Tue, May 8, 2018 at 2:39 AM, Lai Wei <royweilai@gmail.com> wrote:
> >
> > > Hi Anirudh,
> > >
> > > Update: Did an install on a fresh instance with USE_MKLDNN=1, works
> fine
> > > now. Pip install with --pre is also working fine.
> > > Problem is the mkl-dnn I installed on the old instance.
> > > Closing the issue <
> > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > > >.
> > >
> > > Thanks!
> > >
> > > Best Regards
> > >
> > > Lai Wei
> > >
> > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > >
> > > On Mon, May 7, 2018 at 2:48 PM, Lai Wei <royweilai@gmail.com> wrote:
> > >
> > > > Hi Anirudh,
> > > >
> > > > yes, also tried that,  didn't resolve. Looking into root cause and
> will
> > > > update.
> > > >
> > > > Best Regards
> > > >
> > > > Lai Wei
> > > >
> > > > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > >
> > > > On Mon, May 7, 2018 at 2:15 PM, Anirudh <anirudh2290@gmail.com>
> wrote:
> > > >
> > > >> Hi Lai,
> > > >>
> > > >> I see that you used USE_MKL2017_EXPERIMENTAL=1, I am not sure if
> this
> > is
> > > >> the right flag. Did you try USE_MKLDNN=1 ?
> > > >>
> > > >> Anirudh
> > > >>
> > > >> On Mon, May 7, 2018 at 1:22 PM, Lai Wei <royweilai@gmail.com>
> wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > I would like to raise an issue with mxnet-mkl. The keras-mxnet
> > package
> > > >> was
> > > >> > working fine with mxnet-mkl 1.1.0 for training on CPU. However,
> > > weights
> > > >> are
> > > >> > not updated when I use mxnet-mkl 1.2.0b20180507. I tried both
'pip
> > > >> install
> > > >> > mxnet-mkl --pre' and built from source from release branch
> (v1.2.0)
> > > with
> > > >> > mkl flag.
> > > >> >
> > > >> > Please refer to this issue for more details:
> > > >> > https://github.com/awslabs/keras-apache-mxnet/issues/75
> > > >> >
> > > >> > There is no code change on keras-mxnet side, so I guess some
API
> > broke
> > > >> when
> > > >> > using latest mxnet-mkl. Still working on finding the root cause.
> > > >> >
> > > >> > Thanks
> > > >> >
> > > >> >
> > > >> > Best Regards
> > > >> >
> > > >> > Lai Wei
> > > >> >
> > > >> > https://www.linkedin.com/pub/lai-wei/2b/731/52b
> > > >> >
> > > >> > On Mon, May 7, 2018 at 10:38 AM, Haibin Lin <
> > haibin.lin.aws@gmail.com
> > > >
> > > >> > wrote:
> > > >> >
> > > >> > > +1 binding. Build from source with CUDA, ran linear
> classification
> > > >> > example
> > > >> > > and works fine.
> > > >> > >
> > > >> > > Best.
> > > >> > > Haibin
> > > >> > >
> > > >> > >
> > > >> > > On Sun, May 6, 2018 at 10:08 PM, Steffen Rochel <
> > > >> steffenrochel@gmail.com
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > +1 (non-binding). Tested with selected notebooks from
The
> > Straight
> > > >> > Dope.
> > > >> > > > So many important enhancements everybody contributed
and our
> > users
> > > >> are
> > > >> > > > waiting for. Hope we will see more votes.
> > > >> > > > Steffen
> > > >> > > > On Mon, May 7, 2018 at 1:07 AM Anirudh <anirudh2290@gmail.com
> >
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Hi all,
> > > >> > > > >
> > > >> > > > > Since we don't have enough binding votes yet,
I am extending
> > the
> > > >> vote
> > > >> > > > till
> > > >> > > > > tomorrow (Monday May 7th), 12:50 PM PDT.
> > > >> > > > >
> > > >> > > > > Anirudh
> > > >> > > > >
> > > >> > > > > On Sun, May 6, 2018 at 4:05 PM, Anirudh <
> > anirudh2290@gmail.com>
> > > >> > wrote:
> > > >> > > > >
> > > >> > > > > > Hi Pedro,
> > > >> > > > > >
> > > >> > > > > > Thanks for the clarification. I was able
to reproduce the
> > > issue
> > > >> > with
> > > >> > > > > > USE_OPENMP=OFF. I wasn't able to reproduce
the issue with
> > > Make.
> > > >> > Since
> > > >> > > > the
> > > >> > > > > > issue is not reproducible with make and the
customers
> using
> > > >> > > > > USE_OPENMP=OFF
> > > >> > > > > > with cmake should be small, I agree with
you that this
> > should
> > > >> not
> > > >> > be
> > > >> > > a
> > > >> > > > > > blocker. I have added the issue to known
issues in release
> > > >> notes:
> > > >> > > > > >
> https://github.com/apache/incubator-mxnet/releases/tag/1.2.
> > > >> 0.rc2
> > > >> > > > > >
> > > >> > > > > > Anirudh
> > > >> > > > > >
> > > >> > > > > > On Sun, May 6, 2018 at 9:03 AM, Pedro Larroy
<
> > > >> > > > > pedro.larroy.lists@gmail.com
> > > >> > > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > >> Agreed, I was not aware that the problems
where not
> present
> > > in
> > > >> the
> > > >> > > > > release
> > > >> > > > > >> branch.
> > > >> > > > > >>
> > > >> > > > > >> On Fri, May 4, 2018 at 8:32 PM, Haibin
Lin <
> > > >> > > haibin.lin.aws@gmail.com>
> > > >> > > > > >> wrote:
> > > >> > > > > >>
> > > >> > > > > >> > I agree with Anirudh that the focus
of the discussion
> > > should
> > > >> be
> > > >> > > > > limited
> > > >> > > > > >> to
> > > >> > > > > >> > the release branch, not the master
branch. Anything
> that
> > > >> breaks
> > > >> > on
> > > >> > > > > >> master
> > > >> > > > > >> > but works on release branch should
not block the
> release
> > > >> itself.
> > > >> > > > > >> >
> > > >> > > > > >> >
> > > >> > > > > >> > Best,
> > > >> > > > > >> >
> > > >> > > > > >> > Haibin
> > > >> > > > > >> >
> > > >> > > > > >> > On Fri, May 4, 2018 at 10:58 AM,
Pedro Larroy <
> > > >> > > > > >> > pedro.larroy.lists@gmail.com>
> > > >> > > > > >> > wrote:
> > > >> > > > > >> >
> > > >> > > > > >> > > I see your point.
> > > >> > > > > >> > >
> > > >> > > > > >> > > I checked the failures on the
v1.2.0 branch and I
> don't
> > > see
> > > >> > > > > segfaults,
> > > >> > > > > >> > just
> > > >> > > > > >> > > minor failures due to flaky
tests.
> > > >> > > > > >> > >
> > > >> > > > > >> > > I will trigger it repeatedly
a few times until Sunday
> > to
> > > >> have
> > > >> > a
> > > >> > > > and
> > > >> > > > > >> > change
> > > >> > > > > >> > > my vote accordingly.
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > >> > mxnet/job/v1.2.0/
> > > >> > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > > >> > > organizations/jenkins/
> > > >> > > > > >> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > >> > > > > >> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> > > >> > > organizations/jenkins/
> > > >> > > > > >> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > Pedro.
> > > >> > > > > >> > >
> > > >> > > > > >> > > On Fri, May 4, 2018 at 7:16
PM, Anirudh <
> > > >> > anirudh2290@gmail.com>
> > > >> > > > > >> wrote:
> > > >> > > > > >> > >
> > > >> > > > > >> > > > Hi Pedro,
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > Thank you for the suggestions.
I will try to
> > reproduce
> > > >> this
> > > >> > > > > without
> > > >> > > > > >> > fixed
> > > >> > > > > >> > > > seeds and also run it
for a longer time duration.
> > > >> > > > > >> > > > Having said that, running
unit tests over and over
> > for
> > > a
> > > >> > > couple
> > > >> > > > of
> > > >> > > > > >> days
> > > >> > > > > >> > > > will likely cause
> > > >> > > > > >> > > > problems  because there
around 42 open issues for
> > flaky
> > > >> > tests:
> > > >> > > > > >> > > >
> > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > >> > > > > >> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > >> > > > > >> > > > Also, the release branch
has diverged from master
> > > around
> > > >> 3
> > > >> > > weeks
> > > >> > > > > >> back
> > > >> > > > > >> > and
> > > >> > > > > >> > > > it doesn't have many of
the changes merged to the
> > > master.
> > > >> > > > > >> > > > So, my question essentially
is, what will be your
> > > >> benchmark
> > > >> > to
> > > >> > > > > >> accept
> > > >> > > > > >> > the
> > > >> > > > > >> > > > release ?
> > > >> > > > > >> > > > Is it that we run the
test which you provided on
> 1.2
> > > >> without
> > > >> > > > fixed
> > > >> > > > > >> > seeds
> > > >> > > > > >> > > > and for a longer duration
without failures ?
> > > >> > > > > >> > > > Or is it that all unit
tests should pass over a
> > period
> > > >> of 2
> > > >> > > days
> > > >> > > > > >> > without
> > > >> > > > > >> > > > issues. This may require
fixing all of the flaky
> > tests
> > > >> which
> > > >> > > > would
> > > >> > > > > >> > delay
> > > >> > > > > >> > > > the release by considerable
amount of time.
> > > >> > > > > >> > > > Or is it something else
?
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > Anirudh
> > > >> > > > > >> > > >
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > On Fri, May 4, 2018 at
4:49 AM, Pedro Larroy <
> > > >> > > > > >> > > pedro.larroy.lists@gmail.com
> > > >> > > > > >> > > > >
> > > >> > > > > >> > > > wrote:
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > > Could you remove
the fixed seeds and run it for a
> > > >> couple
> > > >> > of
> > > >> > > > > hours
> > > >> > > > > >> > with
> > > >> > > > > >> > > an
> > > >> > > > > >> > > > > additional loop?
 Also I would suggest running
> the
> > > unit
> > > >> > > tests
> > > >> > > > > over
> > > >> > > > > >> > and
> > > >> > > > > >> > > > over
> > > >> > > > > >> > > > > for a couple of days
if possible.
> > > >> > > > > >> > > > >
> > > >> > > > > >> > > > >
> > > >> > > > > >> > > > > Pedro.
> > > >> > > > > >> > > > >
> > > >> > > > > >> > > > > On Thu, May 3, 2018
at 8:33 PM, Anirudh <
> > > >> > > > anirudh2290@gmail.com>
> > > >> > > > > >> > wrote:
> > > >> > > > > >> > > > >
> > > >> > > > > >> > > > > > Hi Pedro and
Naveen,
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > > > I am unable
to reproduce this issue with MKLDNN
> > on
> > > >> the
> > > >> > > > master
> > > >> > > > > >> but
> > > >> > > > > >> > not
> > > >> > > > > >> > > > on
> > > >> > > > > >> > > > > > the 1.2.RC2
branch.
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > > > Did the following
on 1.2.RC2 branch:
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > > > make -j $(nproc)
USE_OPENCV=1 USE_BLAS=openblas
> > > >> > > > > >> USE_DIST_KVSTORE=0
> > > >> > > > > >> > > > > > USE_CUDA=0 USE_CUDNN=0
USE_MKLDNN=1
> > > >> > > > > >> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > >> > > > > >> > > > > > export MXNET_TEST_SEED=11
> > > >> > > > > >> > > > > > export MXNET_MODULE_SEED=812478194
> > > >> > > > > >> > > > > > export MXNET_TEST_COUNT=10000
> > > >> > > > > >> > > > > > nosetests-2.7
-v tests/python/unittest/test_
> > > >> > > > > >> > > > > module.py:test_forward_reshape
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > > > Was able to
do the 10k runs successfully.
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > > > Anirudh
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > > > On Thu, May
3, 2018 at 8:46 AM, Anirudh <
> > > >> > > > > anirudh2290@gmail.com>
> > > >> > > > > >> > > wrote:
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > > > > Hi Pedro
and Naveen,
> > > >> > > > > >> > > > > > >
> > > >> > > > > >> > > > > > > Is this
issue reproducible when MXNet is
> built
> > > with
> > > >> > > > > >> USE_MKLDNN=0?
> > > >> > > > > >> > > > > > > Also, there
are a bunch of MKLDNN fixes that
> > > >> didn't go
> > > >> > > > into
> > > >> > > > > >> the
> > > >> > > > > >> > > > release
> > > >> > > > > >> > > > > > > branch.
Is this issue reproducible on the
> > release
> > > >> > > branch ?
> > > >> > > > > >> > > > > > > In my opinion,
since we have marked MKLDNN as
> > > >> > > experimental
> > > >> > > > > >> > feature
> > > >> > > > > >> > > > for
> > > >> > > > > >> > > > > > the
> > > >> > > > > >> > > > > > > release,
if it is confirmed to be a MKLDNN
> > issue
> > > >> > > > > >> > > > > > > we don't
need to block the release on it.
> > > >> > > > > >> > > > > > >
> > > >> > > > > >> > > > > > > Anirudh
> > > >> > > > > >> > > > > > >
> > > >> > > > > >> > > > > > > On Thu,
May 3, 2018 at 6:58 AM, Naveen Swamy
> <
> > > >> > > > > >> mnnaveen@gmail.com
> > > >> > > > > >> > >
> > > >> > > > > >> > > > > wrote:
> > > >> > > > > >> > > > > > >
> > > >> > > > > >> > > > > > >> Thanks
for raising this issue Pedro.
> > > >> > > > > >> > > > > > >>
> > > >> > > > > >> > > > > > >> -1(binding)
> > > >> > > > > >> > > > > > >>
> > > >> > > > > >> > > > > > >> We
were in a similar state for a while a
> year
> > > >> ago, a
> > > >> > > lot
> > > >> > > > of
> > > >> > > > > >> > effort
> > > >> > > > > >> > > > > went
> > > >> > > > > >> > > > > > to
> > > >> > > > > >> > > > > > >> stabilize
the tests and the CI. I have seen
> > the
> > > PR
> > > >> > > builds
> > > >> > > > > are
> > > >> > > > > >> > > > > > >> non-deterministic
and you have to retry over
> > and
> > > >> over
> > > >> > > > > >> (wasting
> > > >> > > > > >> > > > > resources
> > > >> > > > > >> > > > > > >> and
time) and hope you get lucky.
> > > >> > > > > >> > > > > > >>
> > > >> > > > > >> > > > > > >> Look
at the dashboard for master build
> > > >> > > > > >> > > > > > >> http://jenkins.mxnet-ci.amazon
> > > >> -ml.com/job/incubator-
> > > >> > > > > >> > > > mxnet/job/master/
> > > >> > > > > >> > > > > > >>
> > > >> > > > > >> > > > > > >> -Naveen
> > > >> > > > > >> > > > > > >>
> > > >> > > > > >> > > > > > >> On
Thu, May 3, 2018 at 5:11 AM, Pedro
> Larroy <
> > > >> > > > > >> > > > > > >> pedro.larroy.lists@gmail.com>
> > > >> > > > > >> > > > > > >> wrote:
> > > >> > > > > >> > > > > > >>
> > > >> > > > > >> > > > > > >> >
-1  nondeterminisitc failures on CI
> master:
> > > >> > > > > >> > > > > > >> >
https://issues.apache.org/
> > > jira/browse/MXNET-396
> > > >> > > > > >> > > > > > >> >
> > > >> > > > > >> > > > > > >> >
Was able to reproduce once in a fresh p3
> > > >> instance
> > > >> > > with
> > > >> > > > > >> DLAMI
> > > >> > > > > >> > > > can't
> > > >> > > > > >> > > > > > >> >
reproduce consistently.
> > > >> > > > > >> > > > > > >> >
> > > >> > > > > >> > > > > > >> >
On Wed, May 2, 2018 at 9:51 PM, Anirudh <
> > > >> > > > > >> > anirudh2290@gmail.com>
> > > >> > > > > >> > > > > > wrote:
> > > >> > > > > >> > > > > > >> >
> > > >> > > > > >> > > > > > >> >
> Hi all,
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> As part of RC2 release, we have
> addressed
> > > bugs
> > > >> > and
> > > >> > > > some
> > > >> > > > > >> > > concerns
> > > >> > > > > >> > > > > > that
> > > >> > > > > >> > > > > > >> >
were
> > > >> > > > > >> > > > > > >> >
> raised.
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> I would like to propose a vote to
> release
> > > >> Apache
> > > >> > > > MXNet
> > > >> > > > > >> > > > > (incubating)
> > > >> > > > > >> > > > > > >> >
version
> > > >> > > > > >> > > > > > >> >
> 1.2.0.RC2. Voting will start now
> > (Wednesday,
> > > >> May
> > > >> > > 2nd)
> > > >> > > > > and
> > > >> > > > > >> > end
> > > >> > > > > >> > > at
> > > >> > > > > >> > > > > > >> 12:50
PM
> > > >> > > > > >> > > > > > >> >
> PDT, Sunday, May 6th.
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> Link to release notes:
> > > >> > > > > >> > > > > > >> >
> https://cwiki.apache.org/
> > > >> > confluence/display/MXNET/
> > > >> > > > > >> > > > > > >> >
> Apache+MXNet+%28incubating%29+
> > > >> > 1.2.0+Release+Notes
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> Link to release candidate 1.2.0.rc2:
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > https://github.com/apache/incubator-mxnet/releases/tag/
> > > >> > > > > >> > > > 1.2.0.rc2
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> Voting results for 1.2.0.rc2:
> > > >> > > > > >> > > > > > >> >
> https://lists.apache.org/thread.html/
> > > >> > > > > >> > > > > ebe561c609a8e32351dfe4aafc8876
> > > >> > > > > >> > > > > > >> >
> 199560336472726b58c3455e85@%3C
> > > >> > dev.mxnet.apache.org
> > > >> > > > %3E
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> View this page, click on "Build from
> > > Source",
> > > >> and
> > > >> > > use
> > > >> > > > > the
> > > >> > > > > >> > > source
> > > >> > > > > >> > > > > > code
> > > >> > > > > >> > > > > > >> >
> obtained from 1.2.0.rc2 tag:
> > > >> > > > > >> > > > > > >> >
> https://mxnet.incubator.
> > > >> > > > apache.org/install/index.html
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> (Note: The README.md points to the 1.2.0
> > tag
> > > >> and
> > > >> > > does
> > > >> > > > > not
> > > >> > > > > >> > work
> > > >> > > > > >> > > > at
> > > >> > > > > >> > > > > > the
> > > >> > > > > >> > > > > > >> >
> moment.)
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> Please remember to test first before
> > voting
> > > >> > > > > accordingly:
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> +1 = approve
> > > >> > > > > >> > > > > > >> >
> +0 = no opinion
> > > >> > > > > >> > > > > > >> >
> -1 = disapprove (provide reason)
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> Anirudh
> > > >> > > > > >> > > > > > >> >
>
> > > >> > > > > >> > > > > > >> >
> > > >> > > > > >> > > > > > >>
> > > >> > > > > >> > > > > > >
> > > >> > > > > >> > > > > > >
> > > >> > > > > >> > > > > >
> > > >> > > > > >> > > > >
> > > >> > > > > >> > > >
> > > >> > > > > >> > >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message