mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roshani Nagmote <roshaninagmo...@gmail.com>
Subject Re: [VOTE] Release MXNet version 1.3.0.RC0
Date Wed, 05 Sep 2018 16:57:01 GMT
I believe everyone here is working hard to make MXNet a better framework
for users. It's completely okay to have different opinions, we can decide
together if this issue is a blocker or not after voting time is over.

As I mentioned before, voting will end at 7 pm today. So there is still
time to test the release. If there are any other issues anyone finds, I
will be happy to start the process again and work on RC1. For now, I want
to encourage everyone to utilize this time and vote. :)

Thanks,
Roshani

On Tue, Sep 4, 2018 at 10:35 PM sandeep krishnamurthy <
sandeep.krishna98@gmail.com> wrote:

>    1. As a Apache MXNet community member, I raised the concern of broken
>    functionality for the user. I explained and provided the data points on
> the
>    issue, workaround and why I think it is important. If after all this,
> you
>    think my vote is biased on my employer just because a user I quoted is
> from
>    Amazon, this is more concerning to me on my voting abilities.
>    2. My -1 no where undermines the huge amount of effort that goes behind
>    the scene for a release to happen. Great respect and recognition for
>    everyone involved in all the releases of MXNet in the past and this. I
>    voted on my judgement of what may be good for the users of MXNet.
>    3. As pointed by Naveen & Chris, -1 are NOT veto. Feel free to decide
>    and progress on the release as we already have >3 +1 in this thread.
>
>
> Best,
>
> Sandeep
>
> On Tue, Sep 4, 2018 at 8:29 PM Chris Olivier <cjolivier01@gmail.com>
> wrote:
>
> > btw, there are no vetoes on package releases:
> >
> > VOTES ON PACKAGE RELEASES
> > <https://www.apache.org/foundation/voting.html#ReleaseVotes>
> >
> > Votes on whether a package is ready to be released use majority approval
> > <https://www.apache.org/foundation/glossary.html#MajorityApproval> --
> i.e.
> > at least three PMC members must vote affirmatively for release, and there
> > must be more positive than negative votes.Releases may not be vetoed.
> > Generally
> > the community will cancel the release vote if anyone identifies serious
> > problems, but in most cases the ultimate decision, lies with the
> individual
> > serving as release manager. The specifics of the process may vary from
> > project to project, but the 'minimum quorum of three +1 votes' rule is
> > universal.
> >
> > On Tue, Sep 4, 2018 at 7:12 PM Sheng Zha <szha.pvg@gmail.com> wrote:
> >
> > > Thanks for sharing your opinions, Thomas. Your recognition and respect
> of
> > > people's efforts on preparing the release candidate are certainly
> > > appreciated.
> > >
> > > Now that the vote is set to fail thanks to the veto, there will be
> plenty
> > > of opportunities to include those bug fixes, including the one Zhi
> > > mentioned [1], which was already merged in the master and yet chose not
> > to
> > > block this release with [2]. I will be happy to work with Roshani to
> > > prepare another release candidate once ready.
> > >
> > > -sz
> > >
> > > [1]
> > >
> > >
> >
> https://lists.apache.org/thread.html/f02e952bec22c82cb00a6741390a78f55373311c97464997bb455a6c@%3Cdev.mxnet.apache.org%3E
> > > [2]
> > >
> > >
> >
> https://lists.apache.org/thread.html/85d3fcabb3437ba7f1af455cf69aa13eb3afd1ea1d1f6f891e9c339c@%3Cdev.mxnet.apache.org%3E
> > >
> > > On Tue, Sep 4, 2018 at 6:02 PM Thomas DELTEIL <
> thomas.delteil1@gmail.com
> > >
> > > wrote:
> > >
> > > > -0
> > > > (non-binding)
> > > >
> > > > If I may add some nuancing plus a personal data point as one of the
> > users
> > > > commenting in the bug report in question:
> > > >
> > > > - Performance vs. Basic functionality => I don't think high
> performance
> > > > use-cases and basic functionality are two obviously opposed concepts
> > and
> > > > see no contradiction in Hagay's and Sandeep's statements.
> > > > Float16 support is feature of MXNet that provides more than twice the
> > > > performance of Float32 on supported platforms, hence the high
> > performance
> > > > use-case. The bug is that the basic functionality of reloading a
> saved
> > > > float16 models is currently broken.
> > > >
> > > > - This bug vs Other bugs => Contrary the vast majority of the 140
> open
> > > bugs
> > > > that are mentioned above, I would put to Sandeep's credit that this
> one
> > > bug
> > > > has a PR open that provides a fix for it. This would make it a better
> > > > candidate to get included in this release than a bug that has no fix
> > > ready
> > > > for it.
> > > >
> > > > - Personal datapoint: I recently did some experimentation with
> float16
> > > [1]
> > > > and actually coincidentally just published a video on optimizing
> > > > performance for Gluon. Float16 conversion is one of the most, if not
> > the
> > > > most effective way to get performance out of MXNet [2]. I believe
> there
> > > is
> > > > a lot of value in publicizing more its use and hence making sure at
> > least
> > > > the basic support for normal use-cases is present.
> > > >
> > > > Of course this needs to be balanced with the overhead of preparing a
> > new
> > > > release candidate once the fixed is reviewed and merged, which seems
> to
> > > be
> > > > a lengthy and complex process in its own right, and the delay with
> > > > providing the other features present in 1.3 for users that are not
> > > running
> > > > off the nightly builds.
> > > >
> > > > All the best,
> > > >
> > > > Thomas
> > > >
> > > > [1] https://github.com/ThomasDelteil/PerformanceTricksMXNetGluon
> > > > [2]
> > > >
> > > >
> > >
> >
> https://www.youtube.com/watch?v=Cqo7FPftNyo&t=0s&list=PLkEvNnRk8uVk6U515Pj-jHQUxFC4eDi3m
> > > >
> > > > Le mar. 4 sept. 2018 à 17:11, Sheng Zha <szha.pvg@gmail.com> a
> écrit :
> > > >
> > > > > Sandeep,
> > > > >
> > > > > Thanks for explaining your veto. We have open bugs that impacted
a
> > lot
> > > > more
> > > > > than just 3 customers, just by referring to the number of
> commenters
> > on
> > > > the
> > > > > issue [1].
> > > > >
> > > > > You said that this is for "high performance use cases", which
> > > contradicts
> > > > > with Hagay's assement that this is "basic functionality broken".
> > Given
> > > > that
> > > > > this is for advanced use cases of using half-precision training,
> why
> > is
> > > > it
> > > > > so much more important than any other open bug reports, that for
> this
> > > > > specific bug fix, we have to delay the access of regular users to
> the
> > > new
> > > > > MXNet 1.3 release by at least another week?
> > > > >
> > > > > Honestly, I'm concerned that your vote is biased by Amazon
> > involvement,
> > > > > given that you quoted Amazon Rekognition.
> > > > >
> > > > > -sz
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-mxnet/issues?q=is%3Aissue+is%3Aopen+label%3ABug+sort%3Acomments-desc
> > > > >
> > > > > On Tue, Sep 4, 2018 at 4:51 PM sandeep krishnamurthy <
> > > > > sandeep.krishna98@gmail.com> wrote:
> > > > >
> > > > > > My initial vote of “-0” was due to lack of info from a user
who
> had
> > > > said,
> > > > > > he overcame this issue for FP16 model.
> > > > > >
> > > > > >
> > > > > > However, suggested workaround [1] for the issue is not straight
> > > forward
> > > > > and
> > > > > > generally usable for all users. Also, issue is not simple and
> > > isolated
> > > > to
> > > > > > be listed in the Release Notes as known issue with a workaround.
> > > > > >
> > > > > >
> > > > > > Changing my vote to: "-1 (binding)" owing to the user impact
[3]
> > > > > >
> > > > > >
> > > > > >
> > > > > > @Sheng:
> > > > > >
> > > > > > 1. Agreed, bug existed from long time. However, FP16 and such
> > > > > optimizations
> > > > > > were added later on. Followed by users [2] using this feature
for
> > > high
> > > > > > performance use cases. It is not ok to measure severity of the
> bug
> > > > based
> > > > > on
> > > > > > its past existence, rather we can see who is impacted now and
is
> > it a
> > > > > small
> > > > > > subset with a simple workaround or large user impacting issue.
> > > > > >
> > > > > > 2. Agreed bug was reported 7/21. However, I became aware of
this
> > > issue
> > > > on
> > > > > > 08/29 and submitted the fix on 08/30. Also, I did bring this
to
> the
> > > > > notice
> > > > > > of community, you and 1.3 release manager (Roshani) on the RC0
> > > proposal
> > > > > > thread. Also, I would focus on the issue and user impact than
who
> > > > > > identified and who is fixing the issue.
> > > > > >
> > > > > >
> > > > > > Based on my discussion with 2 users, I think it is a important
> > > feature
> > > > > for
> > > > > > them to see in Apache MXNet v1.3.0.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Sandeep
> > > > > >
> > > > > >
> > > > > > [1] Workaround used by the user.
> > > > > >
> > > > > >
> > > > > > net_fp16 =
> > mx.gluon.SymbolBlock.imports('resnet34_fp16-symbol.json',
> > > > > > ['data'])
> > > > > >
> > > > > > params_fp16 = mx.nd.load('resnet34_fp16-0000.params')
> > > > > >
> > > > > >
> > > > > > for k, v in params_fp16.items():
> > > > > >
> > > > > >     new_key = k.split(':')[1]
> > > > > >
> > > > > >     net_fp16.collect_params()[new_key].cast(v.dtype)
> > > > > >
> > > > > >
> > > > > > net_fp16.collect_params().load('resnet34_fp16-0000.params',
ctx)
> > > > > >
> > > > > >
> > > > > > [2] Amazon Rekognition
> > > > > >
> > > > > >
> > > > > > [3] User story: Train a model -> Cast it to FP16 -> Save
the
> model
> > ->
> > > > > Load
> > > > > > back the model does not work. They have to cast every parameter
> > with
> > > a
> > > > > > workaround mentioned above [1].
> > > > > >
> > > > > > On Tue, Sep 4, 2018 at 4:14 PM Hagay Lupesko <lupesko@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Sheng,
> > > > > > >
> > > > > > > Addressing your questions:
> > > > > > >
> > > > > > > - "why this specific bug is more important than all the
other
> > known
> > > > > bugs,
> > > > > > > that this becomes a release blocker"
> > > > > > > I do not consider it to be more or less important than
other
> > fixes.
> > > > It
> > > > > > can
> > > > > > > be fixed and included in the release alongside the rest
of the
> > > > release
> > > > > > > content, right?
> > > > > > > From the description of the issue it seems important since
it
> is
> > > > > blocking
> > > > > > > users from loading models that were previously trained
and
> saved.
> > > > There
> > > > > > is
> > > > > > > nothing stopping the community from including this fix
into
> > 1.3.0,
> > > > > > > alongside the rest of the features and fixes.
> > > > > > >
> > > > > > > - "The bug exists since SymbolBlock was introduced a year
ago
> and
> > > has
> > > > > > > survived at least three releases, so this is not a regression."
> > > > > > > I do not think I said it is a regression. However, the
fact a
> bug
> > > > > existed
> > > > > > > before, does not mean it is OK to release it rather than
fix
> it.
> > > > > > >
> > > > > > > - "Timeline-wise, this bug was reported on 7/21, but was
not
> > > reported
> > > > > as
> > > > > > > release-blocker in the release discussion thread until
8/31
> [1].
> > > > > Neither
> > > > > > > its reporting as release-blocker nor its fix made it for
the
> 8/3
> > > code
> > > > > > > freeze."
> > > > > > > You are right, would have been better to have this identified
> and
> > > > fixed
> > > > > > > earlier and included before code freeze.
> > > > > > >
> > > > > > > - "The PR is still not ready yet as it doesn't have approval."
> > > > > > > I think it is waiting for your review.
> > > > > > >
> > > > > > > - "it would be great if you could provide some additional
> > reasoning
> > > > > > besides
> > > > > > > "X mentions the issue" or "fix was done by X""
> > > > > > > I have. Repeating what I wrote in my previous email for
> clarity:
> > > > Basic
> > > > > > > functionality broken: loading a model (albeit one that
that was
> > > saved
> > > > > as
> > > > > > > non FP32)
> > > > > > >
> > > > > > > So, yes - this issue seems to have been out there for a
while,
> > > > somehow
> > > > > > went
> > > > > > > under the radar... but I think the key question is whether
this
> > > > blocks
> > > > > a
> > > > > > > basic functionality in MXNet. I believe so, hence my -1
vote.
> > > > > > >
> > > > > > > Hagay
> > > > > > >
> > > > > > > On Tue, Sep 4, 2018 at 1:19 PM Sheng Zha <szha.pvg@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Hi Hagay and Sandeep,
> > > > > > > >
> > > > > > > > Could you help us understand why this specific bug
is more
> > > > important
> > > > > > than
> > > > > > > > all the other known bugs, that this becomes a release
> blocker?
> > > > > > > >
> > > > > > > > Some facts to consider:
> > > > > > > > - The bug exists since SymbolBlock was introduced
a year ago
> > and
> > > > has
> > > > > > > > survived at least three releases, so this is not a
> regression.
> > > > > > > > - Timeline-wise, this bug was reported on 7/21, but
was not
> > > > reported
> > > > > as
> > > > > > > > release-blocker in the release discussion thread until
8/31
> > [1].
> > > > > > Neither
> > > > > > > > its reporting as release-blocker nor its fix made
it for the
> > 8/3
> > > > code
> > > > > > > > freeze.
> > > > > > > > - The PR is still not ready yet as it doesn't have
approval.
> > > > > > > >
> > > > > > > > Hagay, it would be great if you could provide some
additional
> > > > > reasoning
> > > > > > > > besides "X mentions the issue" or "fix was done by
X".
> Thanks.
> > > > > > > >
> > > > > > > > -sz
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/d1ed611f98c20d5d85c294b0c07c8bdebca13a209cf66a3872c9123e@%3Cdev.mxnet.apache.org%3E
> > > > > > > >
> > > > > > > > On Tue, Sep 4, 2018 at 12:39 PM Hagay Lupesko <
> > lupesko@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Sandeep mentions the issue of an error when user
tries to
> > load
> > > > > model
> > > > > > > > params
> > > > > > > > > trained/saved as FP16.
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/11849
> > > > > > > > > The fix was done by Sandeep:
> > > > > > > > > https://github.com/apache/incubator-mxnet/pull/12412
and
> is
> > > > ready
> > > > > to
> > > > > > > be
> > > > > > > > > cherry picked into the release branch.
> > > > > > > > >
> > > > > > > > > This seems like a release blocker to me:
> > > > > > > > > - Basic functionality broken: loading a model
(albeit one
> > that
> > > > that
> > > > > > was
> > > > > > > > > saved as non FP32)
> > > > > > > > > - Reported by 3 users (wgchang@, nicklhy@ and
> ThomasDelteil@
> > )
> > > > > > > > >
> > > > > > > > > -1 (non binding)
> > > > > > > > >
> > > > > > > > > Hagay
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Sep 4, 2018 at 12:01 PM sandeep krishnamurthy
<
> > > > > > > > > sandeep.krishna98@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > "- 0"
> > > > > > > > > >
> > > > > > > > > > I believe the bug #11849
> > > > > > > > > > <https://github.com/apache/incubator-mxnet/issues/11849
> >,
> > > > unable
> > > > > > to
> > > > > > > > > import
> > > > > > > > > > non-fp32 models into Gluon, fixed in this
PR #12412
> > > > > > > > > > <https://github.com/apache/incubator-mxnet/pull/12412>
> is
> > > > > > important
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > users. I would rather pick this fix in this
release than
> > > plan a
> > > > > > minor
> > > > > > > > > > release later.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Sandeep
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Sep 3, 2018 at 2:34 PM Philip Cho
<
> > > > > > > chohyu01@cs.washington.edu>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Actually, the command "git clone --recursive
> > > > > > > > > > > https://github.com/apache/incubator-mxnet
-b
> 1.3.0.rc0"
> > > > works
> > > > > > fine
> > > > > > > > > now,
> > > > > > > > > > > never mind.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Sep 3, 2018 at 1:45 PM Philip
Cho <
> > > > > > > > chohyu01@cs.washington.edu>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Unfortunately, MXNet was depending
on a branch of TVM
> > > that
> > > > is
> > > > > > now
> > > > > > > > > > > deleted.
> > > > > > > > > > > > We will have to merge #12448
> > > > > > > > > > > > <
> https://github.com/apache/incubator-mxnet/pull/12448>
> > > > > before
> > > > > > > the
> > > > > > > > > > > release.
> > > > > > > > > > > >
> > > > > > > > > > > > Background: See dmlc/tvm#1394
<
> > > > > > > > > https://github.com/dmlc/tvm/issues/1394
> > > > > > > > > > >.
> > > > > > > > > > > >
> > > > > > > > > > > > Philip.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Sep 3, 2018 at 7:26 AM
Carin Meier <
> > > > > > carinmeier@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Checked out the tag, built
and tested the Clojure
> > > package.
> > > > > +1
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Fri, Aug 31, 2018 at 10:59
PM Roshani Nagmote <
> > > > > > > > > > > >> roshaninagmote2@gmail.com>
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Hi all,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > I would like to propose
a vote to release Apache
> > MXNet
> > > > > > > > > (incubating)
> > > > > > > > > > > >> version
> > > > > > > > > > > >> > 1.3.0.RC0. Voting will
start now (Friday, Aug
> 31st)
> > > and
> > > > > end
> > > > > > at
> > > > > > > > > 7:00
> > > > > > > > > > PM
> > > > > > > > > > > >> > PDT, Wednesday, Sept
5th.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Link to release notes:
> > > > > > > > > > > >> >
> https://github.com/apache/incubator-mxnet/releases
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Link to release candidate
1.3.0.rc0:
> > > > > > > > > > > >> > *
> > > > > > > >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.3.0.rc
> > > > > > > > > > > >> > <
> > > > > > > >
> > https://github.com/apache/incubator-mxnet/releases/tag/1.3.0.rc0
> > > > > > > > > > >0*
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > View this page, click
on "Build from Source", and
> > use
> > > > the
> > > > > > > source
> > > > > > > > > > code
> > > > > > > > > > > >> > obtained from 1.3.0.rc0
tag:
> > > > > > > > > > > >> >
> > https://mxnet.incubator.apache.org/install/index.html
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Please remember to TEST
first before voting
> > > accordingly:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > +1 = approve
> > > > > > > > > > > >> > +0 = no opinion
> > > > > > > > > > > >> > -1 = disapprove (provide
reason)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thanks,
> > > > > > > > > > > >> > Roshani
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Sandeep Krishnamurthy
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sandeep Krishnamurthy
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Sandeep Krishnamurthy
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message