mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@gmail.com>
Subject Re: Making new operators and AMP lists
Date Tue, 28 May 2019 23:49:52 GMT
I'm having trouble in how far adding the name of an operator in a single
file is too much to expect from somebody and how this is driving people
away.

If somebody adds a tutorial, they also have to add the tutorial to a
specific file. As far as I can tell, this has not resulted in people not
wanting to write tutorials anymore or it being considered as such a big
burden.

So far, I'm really not following why adding a single line to a single file
is considered such a big deal. Considering how long this guide [1] already
is, what's the harm in adding this as an additional instruction? (AMP is
not mentioned there yet, it would be great if that could be follow up)

[1]
https://mxnet.incubator.apache.org/versions/master/faq/add_op_in_backend.html

-Marco

On Wed, May 29, 2019 at 1:42 AM Sheng Zha <zhasheng@apache.org> wrote:

> AMP is in contrib so there's no guarantee that the API is final. Adopting
> the test as-is is harmful because operator authors should not be required
> to invest in an experimental feature that they are not aware of.
>
> I'm all for openness and welcoming, but think about whether you'd like to
> turn away developers who just want to write a CPU-only operator. The more
> you impose on the developers the less likely they will make the
> contribution through.
>
> Having an unfamiliar operator in AMP as a warning could let everyone know
> what the support state is whenever that feature is used. For those who care
> about this, they would see the warning and add the support to get the speed
> benefit of not casting to fp32. In this case, rather than imposing it to
> developers who don't know about AMP, the one who actually uses AMP and
> cares about this feature would drive the work forward.
>
> -sz
>
> On 2019/05/28 23:25:43, Marco de Abreu <marco.g.abreu@gmail.com> wrote:
> > While AMP might be an experimental feature, I rather would like to put
> the
> > focus on the maturity of its interfaces. If the interfaces and the
> actions
> > developers have to do aren't finalized yet, I'd agree with disabling the
> > test. But if the API is final and easy to use, I don't see why adopting
> > early on would be harmful. But from what I can see, the output of the
> test
> > is very meaningful and explicit, easily understandable and offers the
> > developer a clear list of action items that they can follow.
> >
> > If people actually start commenting "CI test failure seems unrelated to
> my
> > change, please proceed and merge", we should advise them to please open
> the
> > result tab, which will directly show the clear list of action items.
> > Committers should support these contributors who are not that familiar
> with
> > our CI system and explain to them how they can retrieve the reasons for
> the
> > failure. Simply ignoring the fact that the change is not compatible with
> > AMP doesn't seem to be the right way.
> >
> > I think nobody is opposed to having AMP in MXNet. As part of accepting
> the
> > feature, we also added AMP to the established methods including the
> coding
> > constraints and other checks that come with it. Lets be open and welcome
> to
> > this new feature and come back if the turnaround time is actually too
> high.
> >
> > -Marco
> >
> >
> > On Wed, May 29, 2019 at 1:13 AM Sheng Zha <zhasheng@apache.org> wrote:
> >
> > > The support for AMP should not be a burden of authors of new operators.
> > > The lint analogy doesn't apply because lint is for established and
> accepted
> > > coding standard at MXNet and AMP is not. AMP is an experimental feature
> > > right now and it doesn't make sense to require contributors to invest
> in
> > > it. Keeping this as error will inevitably cause comments like "CI test
> > > failure seems unrelated to my change, please proceed and merge".
> > >
> > > -sz
> > >
> > > On 2019/05/28 22:51:30, Marco de Abreu <marco.g.abreu@gmail.com>
> wrote:
> > > > Hi,
> > > >
> > > > I'm generally in favour of these kind of tests since they make
> developers
> > > > aware of changes they have to make which they would usually not be
> aware
> > > > of. We have a similar test for tutorials, for example. Whenever
> somebody
> > > > adds a tutorial, there's a validation that assures that all
> contraints in
> > > > our testing environment are met and that they are properly tied into
> the
> > > > system. This AMP test fits into the same category in my opinion and
> we
> > > > never heard bad feedback about these kind of checks.
> > > >
> > > > What seems to be bothering people is the fact that the feedback time
> is
> > > too
> > > > high. Thus, I'd like to propose to move the test into the sanity-test
> > > stage
> > > > instead of doing it as part of the unit tests which take quite a bit
> of
> > > > time until they're actually executed. The sanity checks run
> immediately
> > > and
> > > > give a response within about 1 minute.
> > > >
> > > > While I understand that this might increase the amount of work a
> > > developer
> > > > has to do if they develop a new operator, I think that this is the
> right
> > > > thing to do. Developers won't know of every single feature other
> people
> > > > worked on and thus might simply miss adding the support for it. This
> kind
> > > > of test on the other hand makes them aware of it. If they'd like to
> opt
> > > > out, it's one single line they would have to change and then they're
> > > > totally fine. On the other hand, this might motivate them to add the
> > > > support since the kernel would be the last piece and everything else
> > > would
> > > > already be implemented.
> > > >
> > > > Considering how often a PR gets declined because of linting errors,
> I'd
> > > say
> > > > that these kind of errors are WAY more frequent that AMP telling
> somebody
> > > > to add their operator to a list. Considering that this would only
> have to
> > > > be done once per operator, that's work of about one minute. Add that
> to
> > > the
> > > > waiting time of the sanity check and you're left with about five
> "wasted"
> > > > minutes.
> > > >
> > > > I'm opposed towards adding a warning or treating them as float32 by
> > > default
> > > > since the operator author wouldn't notice. What will happen is that
> > > people
> > > > won't know about AMP and simply forget about low precision in general
> > > until
> > > > they're actively reminded. This check will remind them actively and
> thus
> > > > bring more attention to the feature. I know that the feature is still
> > > > experimental, but we have just started with the 1.6 branch and thus
> > > there's
> > > > enough time to make the experimental features production ready.
> Adding
> > > this
> > > > test early on will allow others to add the support for AMP during the
> > > early
> > > > stage of the 1.6 branch instead of asking them in the last few weeks
> > > before
> > > > a release. The result would only be that stuff is rushed or
> forgotten.
> > > >
> > > > To sum it up: I think this test is good and it should be kept as
> error,
> > > but
> > > > it should be moved to sanity checks.
> > > >
> > > > -Marco
> > > >
> > > > On Wed, May 29, 2019 at 12:21 AM Sheng Zha <zhasheng@apache.org>
> wrote:
> > > >
> > > > > Thanks for initiating the discussion.
> > > > >
> > > > > The premise for adding the test was to make sure that AMP feature
> is
> > > "not
> > > > > broken", but that's IMO not the right view. AMP is not supposed to
> > > support
> > > > > a new operator it hasn't seen before in the first place. There's
no
> > > way for
> > > > > it to know whether the fp32 cast should happen or not. So AMP
> feature
> > > > > cannot provide the guarantee that it works for all future
> operators.
> > > Thus,
> > > > > adding new operators to AMP list should be considered new feature
> > > instead
> > > > > of fixing existing feature.
> > > > >
> > > > > The AMP test that breaks upon the addition of new operator is thus
> > > > > equivalent to forcing developers of the new operator to add the new
> > > support
> > > > > for AMP. This feels wrong. Especially given that AMP is an
> experimental
> > > > > feature in contrib namespace (i.e. no semver guarantee), this
> practice
> > > > > should be stopped immediately. We cannot force new developers to
> invest
> > > > > into experimental feature this way.
> > > > >
> > > > > I'd suggest the following changes:
> > > > > - for new operators that aren't registered in AMP, cast to float32
> by
> > > > > default and print one-time warning. People using AMP who want to
> avoid
> > > > > casting can register it in the AMP's list.
> > > > > - change the test to print warning about the operators that are not
> > > listed
> > > > > so that it's easy to track the problem.
> > > > >
> > > > > -sz
> > > > >
> > > > > On 2019/05/28 21:32:42, Przemys��aw Tr��dak <ptrendx@apache.org>
> > > wrote:
> > > > > > Dear Community,
> > > > > >
> > > > > > One of the recently merged features of the 1.5 release, AMP
> > > (Automatic
> > > > > Mixed Precision) support (PR [1], design doc [5]), introduced a
> > > requirement
> > > > > that every new operator added to MXNet would need to be present in
> 1
> > > of the
> > > > > lists (in [2]). To make sure that this requirement is not broken
> when
> > > > > somebody adds a new operator and does not know about AMP's
> existence, a
> > > > > test was added to CI ([3]).
> > > > > >
> > > > > > A few people reached out to me (the original author of the
> feature)
> > > > > saying this test increases a burden on a developer of new
> operators and
> > > > > should not be an actual error, but just warning (PR for that change
> > > [4]).
> > > > > That is why I would like to present a motivation for it and discuss
> > > with
> > > > > the wider audience why I feel it was necessary.
> > > > > >
> > > > > > First, for people who do not know the details of what AMP is
-
> it is
> > > a
> > > > > solution that tries to automatically apply best practices of
> training
> > > in
> > > > > lower precision (FP16) to user's FP32 model in order to fully
> utilize
> > > > > capabilities of modern GPUs (and potentially other hardware in the
> > > future).
> > > > > It does so by casting to lower precision inputs to operators
> > > benefitting
> > > > > from it, while casting to full precision inputs of operators that
> are
> > > > > unsafe to run in lower precision or just do not support it.
> > > > > >
> > > > > > The first iteration of AMP kept 2 main lists of operators -
> operators
> > > > > that are beneficial and safe to do in fp16 and operators that need
> to
> > > be
> > > > > cast to FP32. The problem (raised in review of the PR [6], [8]) is
> how
> > > to
> > > > > make sure that the feature works as intended and is not
> inadvertently
> > > > > broken by somebody adding a new operator. The failure scenario
> here is
> > > > > adding a new operator that does not support FP16 and so should be
> cast
> > > to
> > > > > FP32, but AMP does not know about its existence and so does not do
> the
> > > > > casting. The solution proposed in the review was to implicitly
> treat
> > > all of
> > > > > the unknown operators as FP32-only and keep the list of operators
> that
> > > work
> > > > > fine in both FP16 and FP32. This solution however does not really
> work,
> > > > > because there are multiple operators (most notably optimizers)
> where
> > > > > introducing additional casting of the input to FP32 would break the
> > > > > operator.
> > > > > >
> > > > > > That is why after discussion with a few members of the
> community, I
> > > > > decided to proceed with all lists being explicit and introducing
> the
> > > test
> > > > > that would fail when somebody added an operator without
> classifying it
> > > into
> > > > > 1 of the categories, and explain clearly how to do it [7]. It is
> not
> > > ideal
> > > > > solution, as it introduces some burden on the developers who are
> not
> > > aware
> > > > > about AMP, however in the typical case of adding at most a few
> > > operators to
> > > > > MXNet the inconvenience is I think pretty minor while important
> for the
> > > > > feature correctness going forward.
> > > > > >
> > > > > > I would like to gather Community feedback and ideas how to handle
> > > this
> > > > > situation.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/14173
> > > > > > [2]
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
> > > > > > [3]
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
> > > > > > [4] https://github.com/apache/incubator-mxnet/pull/15085
> > > > > > [5]
> > > > >
> > >
> https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
> > > > > > [6]
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
> > > > > > [7]
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
> > > > > > [8]
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message