mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@gmail.com>
Subject Re: Making new operators and AMP lists
Date Tue, 28 May 2019 23:27:26 GMT
Hey Jun,

could we please quantify the amount of time and effort that is required to
follow the actions to add an operator to the FP32_FUNCS? To me it sounds
like we are making this a bigger deal than it actually is.

-Marco

On Wed, May 29, 2019 at 1:20 AM Jun Wu <wujun.nju@gmail.com> wrote:

> Thanks for initiating the discussion on dev.
>
> I understand the dilemma from designing AMP for making the feature usable
> and functional as well as for not breaking other developer experience.
> However, IMO, this is not about WHEN we should let other developers know
> they have made a mistake by not adding newly developed ops to the AMP list,
> but rather about whether we should do it in this way and thinking about its
> long-term impact on development.
>
> Asking other developers to delve into the feature they are not necessarily
> aware of has imposed a too strong burden and will hinder others'
> development work. FP16 is a feature nice-to-have in model training, but has
> not become a standard so that every developer should be aware of. As we are
> growing the community by encouraging more junior developers to contribute,
> such error not caused by their work would actually hold them back from
> moving forward.
>
> IMO, AMP is not supposed cover the operators it has not seen and should
> treat them as those in FP32_FUNCS to be safe. We should come up with a
> maintenance plan of promoting newly added operators that can benefit from
> FP16 computing into the appropriate lists, instead of shifting the burden
> away to many other developers at the moment.
>
> On Tue, May 28, 2019 at 4:13 PM Sheng Zha <zhasheng@apache.org> wrote:
>
> > The support for AMP should not be a burden of authors of new operators.
> > The lint analogy doesn't apply because lint is for established and
> accepted
> > coding standard at MXNet and AMP is not. AMP is an experimental feature
> > right now and it doesn't make sense to require contributors to invest in
> > it. Keeping this as error will inevitably cause comments like "CI test
> > failure seems unrelated to my change, please proceed and merge".
> >
> > -sz
> >
> > On 2019/05/28 22:51:30, Marco de Abreu <marco.g.abreu@gmail.com> wrote:
> > > Hi,
> > >
> > > I'm generally in favour of these kind of tests since they make
> developers
> > > aware of changes they have to make which they would usually not be
> aware
> > > of. We have a similar test for tutorials, for example. Whenever
> somebody
> > > adds a tutorial, there's a validation that assures that all contraints
> in
> > > our testing environment are met and that they are properly tied into
> the
> > > system. This AMP test fits into the same category in my opinion and we
> > > never heard bad feedback about these kind of checks.
> > >
> > > What seems to be bothering people is the fact that the feedback time is
> > too
> > > high. Thus, I'd like to propose to move the test into the sanity-test
> > stage
> > > instead of doing it as part of the unit tests which take quite a bit of
> > > time until they're actually executed. The sanity checks run immediately
> > and
> > > give a response within about 1 minute.
> > >
> > > While I understand that this might increase the amount of work a
> > developer
> > > has to do if they develop a new operator, I think that this is the
> right
> > > thing to do. Developers won't know of every single feature other people
> > > worked on and thus might simply miss adding the support for it. This
> kind
> > > of test on the other hand makes them aware of it. If they'd like to opt
> > > out, it's one single line they would have to change and then they're
> > > totally fine. On the other hand, this might motivate them to add the
> > > support since the kernel would be the last piece and everything else
> > would
> > > already be implemented.
> > >
> > > Considering how often a PR gets declined because of linting errors, I'd
> > say
> > > that these kind of errors are WAY more frequent that AMP telling
> somebody
> > > to add their operator to a list. Considering that this would only have
> to
> > > be done once per operator, that's work of about one minute. Add that to
> > the
> > > waiting time of the sanity check and you're left with about five
> "wasted"
> > > minutes.
> > >
> > > I'm opposed towards adding a warning or treating them as float32 by
> > default
> > > since the operator author wouldn't notice. What will happen is that
> > people
> > > won't know about AMP and simply forget about low precision in general
> > until
> > > they're actively reminded. This check will remind them actively and
> thus
> > > bring more attention to the feature. I know that the feature is still
> > > experimental, but we have just started with the 1.6 branch and thus
> > there's
> > > enough time to make the experimental features production ready. Adding
> > this
> > > test early on will allow others to add the support for AMP during the
> > early
> > > stage of the 1.6 branch instead of asking them in the last few weeks
> > before
> > > a release. The result would only be that stuff is rushed or forgotten.
> > >
> > > To sum it up: I think this test is good and it should be kept as error,
> > but
> > > it should be moved to sanity checks.
> > >
> > > -Marco
> > >
> > > On Wed, May 29, 2019 at 12:21 AM Sheng Zha <zhasheng@apache.org>
> wrote:
> > >
> > > > Thanks for initiating the discussion.
> > > >
> > > > The premise for adding the test was to make sure that AMP feature is
> > "not
> > > > broken", but that's IMO not the right view. AMP is not supposed to
> > support
> > > > a new operator it hasn't seen before in the first place. There's no
> > way for
> > > > it to know whether the fp32 cast should happen or not. So AMP feature
> > > > cannot provide the guarantee that it works for all future operators.
> > Thus,
> > > > adding new operators to AMP list should be considered new feature
> > instead
> > > > of fixing existing feature.
> > > >
> > > > The AMP test that breaks upon the addition of new operator is thus
> > > > equivalent to forcing developers of the new operator to add the new
> > support
> > > > for AMP. This feels wrong. Especially given that AMP is an
> experimental
> > > > feature in contrib namespace (i.e. no semver guarantee), this
> practice
> > > > should be stopped immediately. We cannot force new developers to
> invest
> > > > into experimental feature this way.
> > > >
> > > > I'd suggest the following changes:
> > > > - for new operators that aren't registered in AMP, cast to float32 by
> > > > default and print one-time warning. People using AMP who want to
> avoid
> > > > casting can register it in the AMP's list.
> > > > - change the test to print warning about the operators that are not
> > listed
> > > > so that it's easy to track the problem.
> > > >
> > > > -sz
> > > >
> > > > On 2019/05/28 21:32:42, Przemys��aw Tr��dak <ptrendx@apache.org>
> > wrote:
> > > > > Dear Community,
> > > > >
> > > > > One of the recently merged features of the 1.5 release, AMP
> > (Automatic
> > > > Mixed Precision) support (PR [1], design doc [5]), introduced a
> > requirement
> > > > that every new operator added to MXNet would need to be present in 1
> > of the
> > > > lists (in [2]). To make sure that this requirement is not broken when
> > > > somebody adds a new operator and does not know about AMP's
> existence, a
> > > > test was added to CI ([3]).
> > > > >
> > > > > A few people reached out to me (the original author of the feature)
> > > > saying this test increases a burden on a developer of new operators
> and
> > > > should not be an actual error, but just warning (PR for that change
> > [4]).
> > > > That is why I would like to present a motivation for it and discuss
> > with
> > > > the wider audience why I feel it was necessary.
> > > > >
> > > > > First, for people who do not know the details of what AMP is - it
> is
> > a
> > > > solution that tries to automatically apply best practices of training
> > in
> > > > lower precision (FP16) to user's FP32 model in order to fully utilize
> > > > capabilities of modern GPUs (and potentially other hardware in the
> > future).
> > > > It does so by casting to lower precision inputs to operators
> > benefitting
> > > > from it, while casting to full precision inputs of operators that are
> > > > unsafe to run in lower precision or just do not support it.
> > > > >
> > > > > The first iteration of AMP kept 2 main lists of operators -
> operators
> > > > that are beneficial and safe to do in fp16 and operators that need to
> > be
> > > > cast to FP32. The problem (raised in review of the PR [6], [8]) is
> how
> > to
> > > > make sure that the feature works as intended and is not inadvertently
> > > > broken by somebody adding a new operator. The failure scenario here
> is
> > > > adding a new operator that does not support FP16 and so should be
> cast
> > to
> > > > FP32, but AMP does not know about its existence and so does not do
> the
> > > > casting. The solution proposed in the review was to implicitly treat
> > all of
> > > > the unknown operators as FP32-only and keep the list of operators
> that
> > work
> > > > fine in both FP16 and FP32. This solution however does not really
> work,
> > > > because there are multiple operators (most notably optimizers) where
> > > > introducing additional casting of the input to FP32 would break the
> > > > operator.
> > > > >
> > > > > That is why after discussion with a few members of the community,
I
> > > > decided to proceed with all lists being explicit and introducing the
> > test
> > > > that would fail when somebody added an operator without classifying
> it
> > into
> > > > 1 of the categories, and explain clearly how to do it [7]. It is not
> > ideal
> > > > solution, as it introduces some burden on the developers who are not
> > aware
> > > > about AMP, however in the typical case of adding at most a few
> > operators to
> > > > MXNet the inconvenience is I think pretty minor while important for
> the
> > > > feature correctness going forward.
> > > > >
> > > > > I would like to gather Community feedback and ideas how to handle
> > this
> > > > situation.
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/pull/14173
> > > > > [2]
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
> > > > > [3]
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
> > > > > [4] https://github.com/apache/incubator-mxnet/pull/15085
> > > > > [5]
> > > >
> >
> https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
> > > > > [6]
> > > >
> >
> https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
> > > > > [7]
> > > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
> > > > > [8]
> > > >
> >
> https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message