mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anirudh Subramanian <anirudh2...@gmail.com>
Subject Re: Making new operators and AMP lists
Date Wed, 29 May 2019 00:08:27 GMT
The assumption is the AMP requirement is something that has a steep
learning curve. Developers may get confused by the name, but the question
the developer has to essentially answer is (and this can be added in the
error):
1. If the operator can run in FP16 and FP32 modes, put it in
FP16_FP32_FUNCS.
2. If you want to always run the op in FP16 because of the benefits, put it
in FP16.
3. Not sure, Don't know or dont care: FP32 list.

Any operator developer who wrote InferType logic for their operator won't
take much time to choose one of these options.
Now it may be possible that it took a long time before developer realizes
that he has to do this,
thats why I suggested we need to reduce the time it takes for him to
realize that something was missed.

Anirudh

On Tue, May 28, 2019 at 4:57 PM Sheng Zha <zhasheng@apache.org> wrote:

> This is driving people away exactly because they don't know this is what's
> asked of them, and why they are asked of this AMP requirement in the first
> place. To someone who's already familiar with the context, this is little
> to be worried about. It's now the process that requires everyone to become
> familiarized with the AMP requirement that becomes costly. More
> importantly, this is not a question of whether it's too much, but whether
> it should be there in the first place. If it's merely a question of cost I
> imagine you'd have no trouble stepping up and support all of the future
> operators for AMP :)
>
> -sz
>
> On 2019/05/28 23:49:52, Marco de Abreu <marco.g.abreu@gmail.com> wrote:
> > I'm having trouble in how far adding the name of an operator in a single
> > file is too much to expect from somebody and how this is driving people
> > away.
> >
> > If somebody adds a tutorial, they also have to add the tutorial to a
> > specific file. As far as I can tell, this has not resulted in people not
> > wanting to write tutorials anymore or it being considered as such a big
> > burden.
> >
> > So far, I'm really not following why adding a single line to a single
> file
> > is considered such a big deal. Considering how long this guide [1]
> already
> > is, what's the harm in adding this as an additional instruction? (AMP is
> > not mentioned there yet, it would be great if that could be follow up)
> >
> > [1]
> >
> https://mxnet.incubator.apache.org/versions/master/faq/add_op_in_backend.html
> >
> > -Marco
> >
> > On Wed, May 29, 2019 at 1:42 AM Sheng Zha <zhasheng@apache.org> wrote:
> >
> > > AMP is in contrib so there's no guarantee that the API is final.
> Adopting
> > > the test as-is is harmful because operator authors should not be
> required
> > > to invest in an experimental feature that they are not aware of.
> > >
> > > I'm all for openness and welcoming, but think about whether you'd like
> to
> > > turn away developers who just want to write a CPU-only operator. The
> more
> > > you impose on the developers the less likely they will make the
> > > contribution through.
> > >
> > > Having an unfamiliar operator in AMP as a warning could let everyone
> know
> > > what the support state is whenever that feature is used. For those who
> care
> > > about this, they would see the warning and add the support to get the
> speed
> > > benefit of not casting to fp32. In this case, rather than imposing it
> to
> > > developers who don't know about AMP, the one who actually uses AMP and
> > > cares about this feature would drive the work forward.
> > >
> > > -sz
> > >
> > > On 2019/05/28 23:25:43, Marco de Abreu <marco.g.abreu@gmail.com>
> wrote:
> > > > While AMP might be an experimental feature, I rather would like to
> put
> > > the
> > > > focus on the maturity of its interfaces. If the interfaces and the
> > > actions
> > > > developers have to do aren't finalized yet, I'd agree with disabling
> the
> > > > test. But if the API is final and easy to use, I don't see why
> adopting
> > > > early on would be harmful. But from what I can see, the output of the
> > > test
> > > > is very meaningful and explicit, easily understandable and offers the
> > > > developer a clear list of action items that they can follow.
> > > >
> > > > If people actually start commenting "CI test failure seems unrelated
> to
> > > my
> > > > change, please proceed and merge", we should advise them to please
> open
> > > the
> > > > result tab, which will directly show the clear list of action items.
> > > > Committers should support these contributors who are not that
> familiar
> > > with
> > > > our CI system and explain to them how they can retrieve the reasons
> for
> > > the
> > > > failure. Simply ignoring the fact that the change is not compatible
> with
> > > > AMP doesn't seem to be the right way.
> > > >
> > > > I think nobody is opposed to having AMP in MXNet. As part of
> accepting
> > > the
> > > > feature, we also added AMP to the established methods including the
> > > coding
> > > > constraints and other checks that come with it. Lets be open and
> welcome
> > > to
> > > > this new feature and come back if the turnaround time is actually too
> > > high.
> > > >
> > > > -Marco
> > > >
> > > >
> > > > On Wed, May 29, 2019 at 1:13 AM Sheng Zha <zhasheng@apache.org>
> wrote:
> > > >
> > > > > The support for AMP should not be a burden of authors of new
> operators.
> > > > > The lint analogy doesn't apply because lint is for established and
> > > accepted
> > > > > coding standard at MXNet and AMP is not. AMP is an experimental
> feature
> > > > > right now and it doesn't make sense to require contributors to
> invest
> > > in
> > > > > it. Keeping this as error will inevitably cause comments like "CI
> test
> > > > > failure seems unrelated to my change, please proceed and merge".
> > > > >
> > > > > -sz
> > > > >
> > > > > On 2019/05/28 22:51:30, Marco de Abreu <marco.g.abreu@gmail.com>
> > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'm generally in favour of these kind of tests since they make
> > > developers
> > > > > > aware of changes they have to make which they would usually
not
> be
> > > aware
> > > > > > of. We have a similar test for tutorials, for example. Whenever
> > > somebody
> > > > > > adds a tutorial, there's a validation that assures that all
> > > contraints in
> > > > > > our testing environment are met and that they are properly tied
> into
> > > the
> > > > > > system. This AMP test fits into the same category in my opinion
> and
> > > we
> > > > > > never heard bad feedback about these kind of checks.
> > > > > >
> > > > > > What seems to be bothering people is the fact that the feedback
> time
> > > is
> > > > > too
> > > > > > high. Thus, I'd like to propose to move the test into the
> sanity-test
> > > > > stage
> > > > > > instead of doing it as part of the unit tests which take quite
a
> bit
> > > of
> > > > > > time until they're actually executed. The sanity checks run
> > > immediately
> > > > > and
> > > > > > give a response within about 1 minute.
> > > > > >
> > > > > > While I understand that this might increase the amount of work
a
> > > > > developer
> > > > > > has to do if they develop a new operator, I think that this
is
> the
> > > right
> > > > > > thing to do. Developers won't know of every single feature other
> > > people
> > > > > > worked on and thus might simply miss adding the support for
it.
> This
> > > kind
> > > > > > of test on the other hand makes them aware of it. If they'd
like
> to
> > > opt
> > > > > > out, it's one single line they would have to change and then
> they're
> > > > > > totally fine. On the other hand, this might motivate them to
add
> the
> > > > > > support since the kernel would be the last piece and everything
> else
> > > > > would
> > > > > > already be implemented.
> > > > > >
> > > > > > Considering how often a PR gets declined because of linting
> errors,
> > > I'd
> > > > > say
> > > > > > that these kind of errors are WAY more frequent that AMP telling
> > > somebody
> > > > > > to add their operator to a list. Considering that this would
only
> > > have to
> > > > > > be done once per operator, that's work of about one minute.
Add
> that
> > > to
> > > > > the
> > > > > > waiting time of the sanity check and you're left with about
five
> > > "wasted"
> > > > > > minutes.
> > > > > >
> > > > > > I'm opposed towards adding a warning or treating them as float32
> by
> > > > > default
> > > > > > since the operator author wouldn't notice. What will happen
is
> that
> > > > > people
> > > > > > won't know about AMP and simply forget about low precision in
> general
> > > > > until
> > > > > > they're actively reminded. This check will remind them actively
> and
> > > thus
> > > > > > bring more attention to the feature. I know that the feature
is
> still
> > > > > > experimental, but we have just started with the 1.6 branch and
> thus
> > > > > there's
> > > > > > enough time to make the experimental features production ready.
> > > Adding
> > > > > this
> > > > > > test early on will allow others to add the support for AMP
> during the
> > > > > early
> > > > > > stage of the 1.6 branch instead of asking them in the last few
> weeks
> > > > > before
> > > > > > a release. The result would only be that stuff is rushed or
> > > forgotten.
> > > > > >
> > > > > > To sum it up: I think this test is good and it should be kept
as
> > > error,
> > > > > but
> > > > > > it should be moved to sanity checks.
> > > > > >
> > > > > > -Marco
> > > > > >
> > > > > > On Wed, May 29, 2019 at 12:21 AM Sheng Zha <zhasheng@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Thanks for initiating the discussion.
> > > > > > >
> > > > > > > The premise for adding the test was to make sure that AMP
> feature
> > > is
> > > > > "not
> > > > > > > broken", but that's IMO not the right view. AMP is not
> supposed to
> > > > > support
> > > > > > > a new operator it hasn't seen before in the first place.
> There's no
> > > > > way for
> > > > > > > it to know whether the fp32 cast should happen or not.
So AMP
> > > feature
> > > > > > > cannot provide the guarantee that it works for all future
> > > operators.
> > > > > Thus,
> > > > > > > adding new operators to AMP list should be considered new
> feature
> > > > > instead
> > > > > > > of fixing existing feature.
> > > > > > >
> > > > > > > The AMP test that breaks upon the addition of new operator
is
> thus
> > > > > > > equivalent to forcing developers of the new operator to
add
> the new
> > > > > support
> > > > > > > for AMP. This feels wrong. Especially given that AMP is
an
> > > experimental
> > > > > > > feature in contrib namespace (i.e. no semver guarantee),
this
> > > practice
> > > > > > > should be stopped immediately. We cannot force new developers
> to
> > > invest
> > > > > > > into experimental feature this way.
> > > > > > >
> > > > > > > I'd suggest the following changes:
> > > > > > > - for new operators that aren't registered in AMP, cast
to
> float32
> > > by
> > > > > > > default and print one-time warning. People using AMP who
want
> to
> > > avoid
> > > > > > > casting can register it in the AMP's list.
> > > > > > > - change the test to print warning about the operators
that
> are not
> > > > > listed
> > > > > > > so that it's easy to track the problem.
> > > > > > >
> > > > > > > -sz
> > > > > > >
> > > > > > > On 2019/05/28 21:32:42, Przemys��aw Tr��dak <
> ptrendx@apache.org>
> > > > > wrote:
> > > > > > > > Dear Community,
> > > > > > > >
> > > > > > > > One of the recently merged features of the 1.5 release,
AMP
> > > > > (Automatic
> > > > > > > Mixed Precision) support (PR [1], design doc [5]), introduced
a
> > > > > requirement
> > > > > > > that every new operator added to MXNet would need to be
> present in
> > > 1
> > > > > of the
> > > > > > > lists (in [2]). To make sure that this requirement is not
> broken
> > > when
> > > > > > > somebody adds a new operator and does not know about AMP's
> > > existence, a
> > > > > > > test was added to CI ([3]).
> > > > > > > >
> > > > > > > > A few people reached out to me (the original author
of the
> > > feature)
> > > > > > > saying this test increases a burden on a developer of new
> > > operators and
> > > > > > > should not be an actual error, but just warning (PR for
that
> change
> > > > > [4]).
> > > > > > > That is why I would like to present a motivation for it
and
> discuss
> > > > > with
> > > > > > > the wider audience why I feel it was necessary.
> > > > > > > >
> > > > > > > > First, for people who do not know the details of what
AMP is
> -
> > > it is
> > > > > a
> > > > > > > solution that tries to automatically apply best practices
of
> > > training
> > > > > in
> > > > > > > lower precision (FP16) to user's FP32 model in order to
fully
> > > utilize
> > > > > > > capabilities of modern GPUs (and potentially other hardware
in
> the
> > > > > future).
> > > > > > > It does so by casting to lower precision inputs to operators
> > > > > benefitting
> > > > > > > from it, while casting to full precision inputs of operators
> that
> > > are
> > > > > > > unsafe to run in lower precision or just do not support
it.
> > > > > > > >
> > > > > > > > The first iteration of AMP kept 2 main lists of operators
-
> > > operators
> > > > > > > that are beneficial and safe to do in fp16 and operators
that
> need
> > > to
> > > > > be
> > > > > > > cast to FP32. The problem (raised in review of the PR [6],
> [8]) is
> > > how
> > > > > to
> > > > > > > make sure that the feature works as intended and is not
> > > inadvertently
> > > > > > > broken by somebody adding a new operator. The failure scenario
> > > here is
> > > > > > > adding a new operator that does not support FP16 and so
should
> be
> > > cast
> > > > > to
> > > > > > > FP32, but AMP does not know about its existence and so
does
> not do
> > > the
> > > > > > > casting. The solution proposed in the review was to implicitly
> > > treat
> > > > > all of
> > > > > > > the unknown operators as FP32-only and keep the list of
> operators
> > > that
> > > > > work
> > > > > > > fine in both FP16 and FP32. This solution however does
not
> really
> > > work,
> > > > > > > because there are multiple operators (most notably optimizers)
> > > where
> > > > > > > introducing additional casting of the input to FP32 would
> break the
> > > > > > > operator.
> > > > > > > >
> > > > > > > > That is why after discussion with a few members of
the
> > > community, I
> > > > > > > decided to proceed with all lists being explicit and
> introducing
> > > the
> > > > > test
> > > > > > > that would fail when somebody added an operator without
> > > classifying it
> > > > > into
> > > > > > > 1 of the categories, and explain clearly how to do it [7].
It
> is
> > > not
> > > > > ideal
> > > > > > > solution, as it introduces some burden on the developers
who
> are
> > > not
> > > > > aware
> > > > > > > about AMP, however in the typical case of adding at most
a few
> > > > > operators to
> > > > > > > MXNet the inconvenience is I think pretty minor while important
> > > for the
> > > > > > > feature correctness going forward.
> > > > > > > >
> > > > > > > > I would like to gather Community feedback and ideas
how to
> handle
> > > > > this
> > > > > > > situation.
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/14173
> > > > > > > > [2]
> > > > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
> > > > > > > > [3]
> > > > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
> > > > > > > > [4] https://github.com/apache/incubator-mxnet/pull/15085
> > > > > > > > [5]
> > > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
> > > > > > > > [6]
> > > > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
> > > > > > > > [7]
> > > > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
> > > > > > > > [8]
> > > > > > >
> > > > >
> > >
> https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message