mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naveen Swamy <mnnav...@gmail.com>
Subject Re: [VOTE] Release MXNet version 1.3.0.RC0
Date Wed, 05 Sep 2018 02:38:32 GMT
"Releases may not be vetoed"
http://www.apache.org/legal/release-policy.html#release-approval

I haven't tested the release yet, I'll do so tomorrow.

> On Sep 4, 2018, at 7:13 PM, Sheng Zha <szha.pvg@gmail.com> wrote:
> 
> Thanks for sharing your opinions, Thomas. Your recognition and respect of
> people's efforts on preparing the release candidate are certainly
> appreciated.
> 
> Now that the vote is set to fail thanks to the veto, there will be plenty
> of opportunities to include those bug fixes, including the one Zhi
> mentioned [1], which was already merged in the master and yet chose not to
> block this release with [2]. I will be happy to work with Roshani to
> prepare another release candidate once ready.
> 
> -sz
> 
> [1]
> https://lists.apache.org/thread.html/f02e952bec22c82cb00a6741390a78f55373311c97464997bb455a6c@%3Cdev.mxnet.apache.org%3E
> [2]
> https://lists.apache.org/thread.html/85d3fcabb3437ba7f1af455cf69aa13eb3afd1ea1d1f6f891e9c339c@%3Cdev.mxnet.apache.org%3E
> 
> On Tue, Sep 4, 2018 at 6:02 PM Thomas DELTEIL <thomas.delteil1@gmail.com>
> wrote:
> 
>> -0
>> (non-binding)
>> 
>> If I may add some nuancing plus a personal data point as one of the users
>> commenting in the bug report in question:
>> 
>> - Performance vs. Basic functionality => I don't think high performance
>> use-cases and basic functionality are two obviously opposed concepts and
>> see no contradiction in Hagay's and Sandeep's statements.
>> Float16 support is feature of MXNet that provides more than twice the
>> performance of Float32 on supported platforms, hence the high performance
>> use-case. The bug is that the basic functionality of reloading a saved
>> float16 models is currently broken.
>> 
>> - This bug vs Other bugs => Contrary the vast majority of the 140 open bugs
>> that are mentioned above, I would put to Sandeep's credit that this one bug
>> has a PR open that provides a fix for it. This would make it a better
>> candidate to get included in this release than a bug that has no fix ready
>> for it.
>> 
>> - Personal datapoint: I recently did some experimentation with float16 [1]
>> and actually coincidentally just published a video on optimizing
>> performance for Gluon. Float16 conversion is one of the most, if not the
>> most effective way to get performance out of MXNet [2]. I believe there is
>> a lot of value in publicizing more its use and hence making sure at least
>> the basic support for normal use-cases is present.
>> 
>> Of course this needs to be balanced with the overhead of preparing a new
>> release candidate once the fixed is reviewed and merged, which seems to be
>> a lengthy and complex process in its own right, and the delay with
>> providing the other features present in 1.3 for users that are not running
>> off the nightly builds.
>> 
>> All the best,
>> 
>> Thomas
>> 
>> [1] https://github.com/ThomasDelteil/PerformanceTricksMXNetGluon
>> [2]
>> 
>> https://www.youtube.com/watch?v=Cqo7FPftNyo&t=0s&list=PLkEvNnRk8uVk6U515Pj-jHQUxFC4eDi3m
>> 
>>> Le mar. 4 sept. 2018 à 17:11, Sheng Zha <szha.pvg@gmail.com> a écrit
:
>>> 
>>> Sandeep,
>>> 
>>> Thanks for explaining your veto. We have open bugs that impacted a lot
>> more
>>> than just 3 customers, just by referring to the number of commenters on
>> the
>>> issue [1].
>>> 
>>> You said that this is for "high performance use cases", which contradicts
>>> with Hagay's assement that this is "basic functionality broken". Given
>> that
>>> this is for advanced use cases of using half-precision training, why is
>> it
>>> so much more important than any other open bug reports, that for this
>>> specific bug fix, we have to delay the access of regular users to the new
>>> MXNet 1.3 release by at least another week?
>>> 
>>> Honestly, I'm concerned that your vote is biased by Amazon involvement,
>>> given that you quoted Amazon Rekognition.
>>> 
>>> -sz
>>> 
>>> [1]
>>> 
>>> 
>> https://github.com/apache/incubator-mxnet/issues?q=is%3Aissue+is%3Aopen+label%3ABug+sort%3Acomments-desc
>>> 
>>> On Tue, Sep 4, 2018 at 4:51 PM sandeep krishnamurthy <
>>> sandeep.krishna98@gmail.com> wrote:
>>> 
>>>> My initial vote of “-0” was due to lack of info from a user who had
>> said,
>>>> he overcame this issue for FP16 model.
>>>> 
>>>> 
>>>> However, suggested workaround [1] for the issue is not straight forward
>>> and
>>>> generally usable for all users. Also, issue is not simple and isolated
>> to
>>>> be listed in the Release Notes as known issue with a workaround.
>>>> 
>>>> 
>>>> Changing my vote to: "-1 (binding)" owing to the user impact [3]
>>>> 
>>>> 
>>>> 
>>>> @Sheng:
>>>> 
>>>> 1. Agreed, bug existed from long time. However, FP16 and such
>>> optimizations
>>>> were added later on. Followed by users [2] using this feature for high
>>>> performance use cases. It is not ok to measure severity of the bug
>> based
>>> on
>>>> its past existence, rather we can see who is impacted now and is it a
>>> small
>>>> subset with a simple workaround or large user impacting issue.
>>>> 
>>>> 2. Agreed bug was reported 7/21. However, I became aware of this issue
>> on
>>>> 08/29 and submitted the fix on 08/30. Also, I did bring this to the
>>> notice
>>>> of community, you and 1.3 release manager (Roshani) on the RC0 proposal
>>>> thread. Also, I would focus on the issue and user impact than who
>>>> identified and who is fixing the issue.
>>>> 
>>>> 
>>>> Based on my discussion with 2 users, I think it is a important feature
>>> for
>>>> them to see in Apache MXNet v1.3.0.
>>>> 
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> Sandeep
>>>> 
>>>> 
>>>> [1] Workaround used by the user.
>>>> 
>>>> 
>>>> net_fp16 = mx.gluon.SymbolBlock.imports('resnet34_fp16-symbol.json',
>>>> ['data'])
>>>> 
>>>> params_fp16 = mx.nd.load('resnet34_fp16-0000.params')
>>>> 
>>>> 
>>>> for k, v in params_fp16.items():
>>>> 
>>>>    new_key = k.split(':')[1]
>>>> 
>>>>    net_fp16.collect_params()[new_key].cast(v.dtype)
>>>> 
>>>> 
>>>> net_fp16.collect_params().load('resnet34_fp16-0000.params', ctx)
>>>> 
>>>> 
>>>> [2] Amazon Rekognition
>>>> 
>>>> 
>>>> [3] User story: Train a model -> Cast it to FP16 -> Save the model
->
>>> Load
>>>> back the model does not work. They have to cast every parameter with a
>>>> workaround mentioned above [1].
>>>> 
>>>> On Tue, Sep 4, 2018 at 4:14 PM Hagay Lupesko <lupesko@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Sheng,
>>>>> 
>>>>> Addressing your questions:
>>>>> 
>>>>> - "why this specific bug is more important than all the other known
>>> bugs,
>>>>> that this becomes a release blocker"
>>>>> I do not consider it to be more or less important than other fixes.
>> It
>>>> can
>>>>> be fixed and included in the release alongside the rest of the
>> release
>>>>> content, right?
>>>>> From the description of the issue it seems important since it is
>>> blocking
>>>>> users from loading models that were previously trained and saved.
>> There
>>>> is
>>>>> nothing stopping the community from including this fix into 1.3.0,
>>>>> alongside the rest of the features and fixes.
>>>>> 
>>>>> - "The bug exists since SymbolBlock was introduced a year ago and has
>>>>> survived at least three releases, so this is not a regression."
>>>>> I do not think I said it is a regression. However, the fact a bug
>>> existed
>>>>> before, does not mean it is OK to release it rather than fix it.
>>>>> 
>>>>> - "Timeline-wise, this bug was reported on 7/21, but was not reported
>>> as
>>>>> release-blocker in the release discussion thread until 8/31 [1].
>>> Neither
>>>>> its reporting as release-blocker nor its fix made it for the 8/3 code
>>>>> freeze."
>>>>> You are right, would have been better to have this identified and
>> fixed
>>>>> earlier and included before code freeze.
>>>>> 
>>>>> - "The PR is still not ready yet as it doesn't have approval."
>>>>> I think it is waiting for your review.
>>>>> 
>>>>> - "it would be great if you could provide some additional reasoning
>>>> besides
>>>>> "X mentions the issue" or "fix was done by X""
>>>>> I have. Repeating what I wrote in my previous email for clarity:
>> Basic
>>>>> functionality broken: loading a model (albeit one that that was saved
>>> as
>>>>> non FP32)
>>>>> 
>>>>> So, yes - this issue seems to have been out there for a while,
>> somehow
>>>> went
>>>>> under the radar... but I think the key question is whether this
>> blocks
>>> a
>>>>> basic functionality in MXNet. I believe so, hence my -1 vote.
>>>>> 
>>>>> Hagay
>>>>> 
>>>>>> On Tue, Sep 4, 2018 at 1:19 PM Sheng Zha <szha.pvg@gmail.com>
wrote:
>>>>>> 
>>>>>> Hi Hagay and Sandeep,
>>>>>> 
>>>>>> Could you help us understand why this specific bug is more
>> important
>>>> than
>>>>>> all the other known bugs, that this becomes a release blocker?
>>>>>> 
>>>>>> Some facts to consider:
>>>>>> - The bug exists since SymbolBlock was introduced a year ago and
>> has
>>>>>> survived at least three releases, so this is not a regression.
>>>>>> - Timeline-wise, this bug was reported on 7/21, but was not
>> reported
>>> as
>>>>>> release-blocker in the release discussion thread until 8/31 [1].
>>>> Neither
>>>>>> its reporting as release-blocker nor its fix made it for the 8/3
>> code
>>>>>> freeze.
>>>>>> - The PR is still not ready yet as it doesn't have approval.
>>>>>> 
>>>>>> Hagay, it would be great if you could provide some additional
>>> reasoning
>>>>>> besides "X mentions the issue" or "fix was done by X". Thanks.
>>>>>> 
>>>>>> -sz
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://lists.apache.org/thread.html/d1ed611f98c20d5d85c294b0c07c8bdebca13a209cf66a3872c9123e@%3Cdev.mxnet.apache.org%3E
>>>>>> 
>>>>>> On Tue, Sep 4, 2018 at 12:39 PM Hagay Lupesko <lupesko@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Sandeep mentions the issue of an error when user tries to load
>>> model
>>>>>> params
>>>>>>> trained/saved as FP16.
>>>>>>> https://github.com/apache/incubator-mxnet/issues/11849
>>>>>>> The fix was done by Sandeep:
>>>>>>> https://github.com/apache/incubator-mxnet/pull/12412 and is
>> ready
>>> to
>>>>> be
>>>>>>> cherry picked into the release branch.
>>>>>>> 
>>>>>>> This seems like a release blocker to me:
>>>>>>> - Basic functionality broken: loading a model (albeit one that
>> that
>>>> was
>>>>>>> saved as non FP32)
>>>>>>> - Reported by 3 users (wgchang@, nicklhy@ and ThomasDelteil@)
>>>>>>> 
>>>>>>> -1 (non binding)
>>>>>>> 
>>>>>>> Hagay
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Sep 4, 2018 at 12:01 PM sandeep krishnamurthy <
>>>>>>> sandeep.krishna98@gmail.com> wrote:
>>>>>>> 
>>>>>>>> "- 0"
>>>>>>>> 
>>>>>>>> I believe the bug #11849
>>>>>>>> <https://github.com/apache/incubator-mxnet/issues/11849>,
>> unable
>>>> to
>>>>>>> import
>>>>>>>> non-fp32 models into Gluon, fixed in this PR #12412
>>>>>>>> <https://github.com/apache/incubator-mxnet/pull/12412>
is
>>>> important
>>>>>> for
>>>>>>>> the
>>>>>>>> users. I would rather pick this fix in this release than
plan a
>>>> minor
>>>>>>>> release later.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Sandeep
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Sep 3, 2018 at 2:34 PM Philip Cho <
>>>>> chohyu01@cs.washington.edu>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Actually, the command "git clone --recursive
>>>>>>>>> https://github.com/apache/incubator-mxnet -b 1.3.0.rc0"
>> works
>>>> fine
>>>>>>> now,
>>>>>>>>> never mind.
>>>>>>>>> 
>>>>>>>>> On Mon, Sep 3, 2018 at 1:45 PM Philip Cho <
>>>>>> chohyu01@cs.washington.edu>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Unfortunately, MXNet was depending on a branch of
TVM that
>> is
>>>> now
>>>>>>>>> deleted.
>>>>>>>>>> We will have to merge #12448
>>>>>>>>>> <https://github.com/apache/incubator-mxnet/pull/12448>
>>> before
>>>>> the
>>>>>>>>> release.
>>>>>>>>>> 
>>>>>>>>>> Background: See dmlc/tvm#1394 <
>>>>>>> https://github.com/dmlc/tvm/issues/1394
>>>>>>>>> .
>>>>>>>>>> 
>>>>>>>>>> Philip.
>>>>>>>>>> 
>>>>>>>>>> On Mon, Sep 3, 2018 at 7:26 AM Carin Meier <
>>>> carinmeier@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Checked out the tag, built and tested the Clojure
package.
>>> +1
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Aug 31, 2018 at 10:59 PM Roshani Nagmote
<
>>>>>>>>>>> roshaninagmote2@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> 
>>>>>>>>>>>> I would like to propose a vote to release
Apache MXNet
>>>>>>> (incubating)
>>>>>>>>>>> version
>>>>>>>>>>>> 1.3.0.RC0. Voting will start now (Friday,
Aug 31st) and
>>> end
>>>> at
>>>>>>> 7:00
>>>>>>>> PM
>>>>>>>>>>>> PDT, Wednesday, Sept 5th.
>>>>>>>>>>>> 
>>>>>>>>>>>> Link to release notes:
>>>>>>>>>>>> https://github.com/apache/incubator-mxnet/releases
>>>>>>>>>>>> 
>>>>>>>>>>>> Link to release candidate 1.3.0.rc0:
>>>>>>>>>>>> *
>>>>>> https://github.com/apache/incubator-mxnet/releases/tag/1.3.0.rc
>>>>>>>>>>>> <
>>>>>> https://github.com/apache/incubator-mxnet/releases/tag/1.3.0.rc0
>>>>>>>>> 0*
>>>>>>>>>>>> 
>>>>>>>>>>>> View this page, click on "Build from Source",
and use
>> the
>>>>> source
>>>>>>>> code
>>>>>>>>>>>> obtained from 1.3.0.rc0 tag:
>>>>>>>>>>>> https://mxnet.incubator.apache.org/install/index.html
>>>>>>>>>>>> 
>>>>>>>>>>>> Please remember to TEST first before voting
accordingly:
>>>>>>>>>>>> 
>>>>>>>>>>>> +1 = approve
>>>>>>>>>>>> +0 = no opinion
>>>>>>>>>>>> -1 = disapprove (provide reason)
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Roshani
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Sandeep Krishnamurthy
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sandeep Krishnamurthy
>>>> 
>>> 
>> 

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message