mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anirudh <anirudh2...@gmail.com>
Subject Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2
Date Sun, 06 May 2018 00:06:08 GMT
Hi Pedro,

Thank you for raising this issue! I am not able to reproduce this on ubuntu
16.04 and cmake 3.5.1.
Can you please provide the reproduction steps for the issue.

Anirudh

On Sat, May 5, 2018 at 3:12 AM, Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> Actually I have a linking problem in my ubuntu desktop that is fixed in
> master:
>
> lc::ThreadedIter<std::vector<dmlc::data::RowBlockContainer<unsigned int>,
> std::allocator<dmlc::data::RowBlockContainer<unsigned int> > >
> >::Init(std::function<bool
> (std::vector<dmlc::data::RowBlockContainer<unsigned int>,
> std::allocator<dmlc::data::RowBlockContainer<unsigned int> > >**)>,
> std::function<void ()>)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread<dmlc::ThreadedIter<std::vector<dmlc:
> :data::RowBlockContainer<unsigned
> long>, std::allocator<dmlc::data::RowBlockContainer<unsigned long> > >
> >::Init(std::function<bool
> (std::vector<dmlc::data::RowBlockContainer<unsigned long>,
> std::allocator<dmlc::data::RowBlockContainer<unsigned long> > >**)>,
> std::function<void
> ()>)::{lambda()#1}&>(dmlc::ThreadedIter<std::vector<dmlc:
> :data::RowBlockContainer<unsigned
> long>, std::allocator<dmlc::data::RowBlockContainer<unsigned long> > >
> >::Init(std::function<bool
> (std::vector<dmlc::data::RowBlockContainer<unsigned long>,
> std::allocator<dmlc::data::RowBlockContainer<unsigned long> > >**)>,
> std::function<void ()>)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread<dmlc::ThreadedIter<dmlc::data::
> RowBlockContainer<unsigned
> int> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned
> int>**)>, std::function<void
> ()>)::{lambda()#1}&>(dmlc::ThreadedIter<dmlc::data::
> RowBlockContainer<unsigned
> int> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned
> int>**)>, std::function<void ()>)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function
> `std::thread::thread<dmlc::ThreadedIter<dmlc::data::
> RowBlockContainer<unsigned
> long> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned
> long>**)>, std::function<void
> ()>)::{lambda()#1}&>(dmlc::ThreadedIter<dmlc::data::
> RowBlockContainer<unsigned
> long> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned
> long>**)>, std::function<void ()>)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> 3rdparty/dmlc-core/libdmlc.a(io.cc.o): In function
> `std::thread::thread<dmlc::ThreadedIter<dmlc::io::
> InputSplitBase::Chunk>::Init(std::function<bool
> (dmlc::io::InputSplitBase::Chunk**)>, std::function<void
> ()>)::{lambda()#1}&>(dmlc::ThreadedIter<dmlc::io::
> InputSplitBase::Chunk>::Init(std::function<bool
> (dmlc::io::InputSplitBase::Chunk**)>, std::function<void
> ()>)::{lambda()#1}&)':
> /usr/include/c++/5/thread:137: undefined reference to `pthread_create'
> collect2: error: ld returned 1 exit status
> ninja: build stopped: subcommand failed.
>
>
> Can we update dmlc-core on the release branch?  this was recently fixed:
> https://github.com/dmlc/dmlc-core/commit/b744643f386660ddc39467a04e3a98
> 853a7419b9
>
> On Sat, May 5, 2018 at 11:59 AM, Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> wrote:
>
> > Hi
> >
> > Looks like only gluon test lambda is failing intermittently, but looks
> > like a minor numerical issue.
> >
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/
> > jenkins/incubator-mxnet/detail/v1.2.0/20/pipeline
> >
> > I triggered a few builds yesterday and they all passed. I think Anirudh
> is
> > right.
> >
> > Changing my vote to +1 (non binding).
> >
> >
> > Pedro.
> >
> >
> >
> > On Sat, May 5, 2018 at 12:10 AM, Jun Wu <wujun.nju@gmail.com> wrote:
> >
> >> +1
> >> I built from source and ran all the model quantization examples
> >> successfully.
> >>
> >> On Fri, May 4, 2018 at 3:05 PM, Anirudh <anirudh2290@gmail.com> wrote:
> >>
> >> > Hi Pedro, Haibin, Indhu,
> >> >
> >> > Thank you for your inputs on the release. I ran the test:
> >> > `test_module.py:test_forward_reshape` for 250k times with different
> >> seeds.
> >> > I was unable to reproduce the issue on the release branch.
> >> > If everything goes well with CI tests by Pedro running till Sunday, I
> >> think
> >> > we should move forward with the release (given that we have enough
> +1s).
> >> > Is it possible to trigger the CI on the 1.2 branch repeatedly or at a
> >> fixed
> >> > schedule till Sunday?
> >> >
> >> > Anirudh
> >> >
> >> > On Fri, May 4, 2018 at 11:56 AM, Indhu <indhubharathi@gmail.com>
> wrote:
> >> >
> >> > > +1
> >> > >
> >> > > I've been using CUDA build from this branch (built from source) on
> >> Ubuntu
> >> > > for couple of days now and I haven't seen any issue.
> >> > >
> >> > > The flaky tests need to be fixed but this release need not be
> blocked
> >> for
> >> > > that.
> >> > >
> >> > >
> >> > > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin <
> haibin.lin.aws@gmail.com
> >> >
> >> > > wrote:
> >> > >
> >> > > > I agree with Anirudh that the focus of the discussion should
be
> >> limited
> >> > > to
> >> > > > the release branch, not the master branch. Anything that breaks
on
> >> > master
> >> > > > but works on release branch should not block the release itself.
> >> > > >
> >> > > >
> >> > > > Best,
> >> > > >
> >> > > > Haibin
> >> > > >
> >> > > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> >> > > > pedro.larroy.lists@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > I see your point.
> >> > > > >
> >> > > > > I checked the failures on the v1.2.0 branch and I don't
see
> >> > segfaults,
> >> > > > just
> >> > > > > minor failures due to flaky tests.
> >> > > > >
> >> > > > > I will trigger it repeatedly a few times until Sunday to
have a
> >> and
> >> > > > change
> >> > > > > my vote accordingly.
> >> > > > >
> >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> >> > mxnet/job/v1.2.0/
> >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> >> > > > > incubator-mxnet/detail/v1.2.0/17/pipeline
> >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
> >> > > > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> >> > > > >
> >> > > > >
> >> > > > > Pedro.
> >> > > > >
> >> > > > > On Fri, May 4, 2018 at 7:16 PM, Anirudh <anirudh2290@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > Hi Pedro,
> >> > > > > >
> >> > > > > > Thank you for the suggestions. I will try to reproduce
this
> >> without
> >> > > > fixed
> >> > > > > > seeds and also run it for a longer time duration.
> >> > > > > > Having said that, running unit tests over and over
for a
> couple
> >> of
> >> > > days
> >> > > > > > will likely cause
> >> > > > > > problems  because there around 42 open issues for flaky
tests:
> >> > > > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> >> > > > > > 3Aopen+is%3Aissue+label%3AFlaky
> >> > > > > > Also, the release branch has diverged from master around
3
> weeks
> >> > back
> >> > > > and
> >> > > > > > it doesn't have many of the changes merged to the master.
> >> > > > > > So, my question essentially is, what will be your benchmark
to
> >> > accept
> >> > > > the
> >> > > > > > release ?
> >> > > > > > Is it that we run the test which you provided on 1.2
without
> >> fixed
> >> > > > seeds
> >> > > > > > and for a longer duration without failures ?
> >> > > > > > Or is it that all unit tests should pass over a period
of 2
> days
> >> > > > without
> >> > > > > > issues. This may require fixing all of the flaky tests
which
> >> would
> >> > > > delay
> >> > > > > > the release by considerable amount of time.
> >> > > > > > Or is it something else ?
> >> > > > > >
> >> > > > > > Anirudh
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> >> > > > > pedro.larroy.lists@gmail.com
> >> > > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Could you remove the fixed seeds and run it for
a couple of
> >> hours
> >> > > > with
> >> > > > > an
> >> > > > > > > additional loop?  Also I would suggest running
the unit
> tests
> >> > over
> >> > > > and
> >> > > > > > over
> >> > > > > > > for a couple of days if possible.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Pedro.
> >> > > > > > >
> >> > > > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh <
> >> anirudh2290@gmail.com>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi Pedro and Naveen,
> >> > > > > > > >
> >> > > > > > > > I am unable to reproduce this issue with
MKLDNN on the
> >> master
> >> > but
> >> > > > not
> >> > > > > > on
> >> > > > > > > > the 1.2.RC2 branch.
> >> > > > > > > >
> >> > > > > > > > Did the following on 1.2.RC2 branch:
> >> > > > > > > >
> >> > > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> >> > > USE_DIST_KVSTORE=0
> >> > > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> >> > > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> >> > > > > > > > export MXNET_TEST_SEED=11
> >> > > > > > > > export MXNET_MODULE_SEED=812478194
> >> > > > > > > > export MXNET_TEST_COUNT=10000
> >> > > > > > > > nosetests-2.7 -v tests/python/unittest/test_
> >> > > > > > > module.py:test_forward_reshape
> >> > > > > > > >
> >> > > > > > > > Was able to do the 10k runs successfully.
> >> > > > > > > >
> >> > > > > > > > Anirudh
> >> > > > > > > >
> >> > > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <
> >> anirudh2290@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi Pedro and Naveen,
> >> > > > > > > > >
> >> > > > > > > > > Is this issue reproducible when MXNet
is built with
> >> > > USE_MKLDNN=0?
> >> > > > > > > > > Also, there are a bunch of MKLDNN fixes
that didn't go
> >> into
> >> > the
> >> > > > > > release
> >> > > > > > > > > branch. Is this issue reproducible on
the release
> branch ?
> >> > > > > > > > > In my opinion, since we have marked
MKLDNN as
> experimental
> >> > > > feature
> >> > > > > > for
> >> > > > > > > > the
> >> > > > > > > > > release, if it is confirmed to be a
MKLDNN issue
> >> > > > > > > > > we don't need to block the release on
it.
> >> > > > > > > > >
> >> > > > > > > > > Anirudh
> >> > > > > > > > >
> >> > > > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen
Swamy <
> >> > > mnnaveen@gmail.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> Thanks for raising this issue Pedro.
> >> > > > > > > > >>
> >> > > > > > > > >> -1(binding)
> >> > > > > > > > >>
> >> > > > > > > > >> We were in a similar state for a
while a year ago, a
> lot
> >> of
> >> > > > effort
> >> > > > > > > went
> >> > > > > > > > to
> >> > > > > > > > >> stabilize the tests and the CI.
I have seen the PR
> builds
> >> > are
> >> > > > > > > > >> non-deterministic and you have to
retry over and over
> >> > (wasting
> >> > > > > > > resources
> >> > > > > > > > >> and time) and hope you get lucky.
> >> > > > > > > > >>
> >> > > > > > > > >> Look at the dashboard for master
build
> >> > > > > > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> >> > > > > > mxnet/job/master/
> >> > > > > > > > >>
> >> > > > > > > > >> -Naveen
> >> > > > > > > > >>
> >> > > > > > > > >> On Thu, May 3, 2018 at 5:11 AM,
Pedro Larroy <
> >> > > > > > > > >> pedro.larroy.lists@gmail.com>
> >> > > > > > > > >> wrote:
> >> > > > > > > > >>
> >> > > > > > > > >> > -1  nondeterminisitc failures
on CI master:
> >> > > > > > > > >> > https://issues.apache.org/jira/browse/MXNET-396
> >> > > > > > > > >> >
> >> > > > > > > > >> > Was able to reproduce once
in a fresh p3 instance
> with
> >> > DLAMI
> >> > > > > > can't
> >> > > > > > > > >> > reproduce consistently.
> >> > > > > > > > >> >
> >> > > > > > > > >> > On Wed, May 2, 2018 at 9:51
PM, Anirudh <
> >> > > > anirudh2290@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > > >> >
> >> > > > > > > > >> > > Hi all,
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > As part of RC2 release,
we have addressed bugs and
> >> some
> >> > > > > concerns
> >> > > > > > > > that
> >> > > > > > > > >> > were
> >> > > > > > > > >> > > raised.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > I would like to propose
a vote to release Apache
> >> MXNet
> >> > > > > > > (incubating)
> >> > > > > > > > >> > version
> >> > > > > > > > >> > > 1.2.0.RC2. Voting will
start now (Wednesday, May
> 2nd)
> >> > and
> >> > > > end
> >> > > > > at
> >> > > > > > > > >> 12:50 PM
> >> > > > > > > > >> > > PDT, Sunday, May 6th.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Link to release notes:
> >> > > > > > > > >> > > https://cwiki.apache.org/confluence/display/MXNET/
> >> > > > > > > > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Link to release candidate
1.2.0.rc2:
> >> > > > > > > > >> > > https://github.com/apache/incu
> >> bator-mxnet/releases/tag/
> >> > > > > > 1.2.0.rc2
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Voting results for 1.2.0.rc2:
> >> > > > > > > > >> > > https://lists.apache.org/thread.html/
> >> > > > > > > ebe561c609a8e32351dfe4aafc8876
> >> > > > > > > > >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org
> >> %3E
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > View this page, click
on "Build from Source", and
> use
> >> > the
> >> > > > > source
> >> > > > > > > > code
> >> > > > > > > > >> > > obtained from 1.2.0.rc2
tag:
> >> > > > > > > > >> > > https://mxnet.incubator.apache
> >> .org/install/index.html
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > (Note: The README.md points
to the 1.2.0 tag and
> does
> >> > not
> >> > > > work
> >> > > > > > at
> >> > > > > > > > the
> >> > > > > > > > >> > > moment.)
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Please remember to test
first before voting
> >> accordingly:
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > +1 = approve
> >> > > > > > > > >> > > +0 = no opinion
> >> > > > > > > > >> > > -1 = disapprove (provide
reason)
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Anirudh
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message