mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sheng Zha <zhash...@apache.org>
Subject Re: Making new operators and AMP lists
Date Tue, 28 May 2019 23:48:49 GMT
Please don't be dismissive. For a new contributor, every time we impose a requirement to the
developer, a new task would pop up out of nowhere. The more we do that, the more costly it
becomes for new contributors to have their contributions accepted, and hence the less likely
they will stick around. Let's avoid these unnecessary requirements from hindering the development
process that could have been a lot more smooth.

-sz

On 2019/05/28 23:27:26, Marco de Abreu <marco.g.abreu@gmail.com> wrote: 
> Hey Jun,
> 
> could we please quantify the amount of time and effort that is required to
> follow the actions to add an operator to the FP32_FUNCS? To me it sounds
> like we are making this a bigger deal than it actually is.
> 
> -Marco
> 
> On Wed, May 29, 2019 at 1:20 AM Jun Wu <wujun.nju@gmail.com> wrote:
> 
> > Thanks for initiating the discussion on dev.
> >
> > I understand the dilemma from designing AMP for making the feature usable
> > and functional as well as for not breaking other developer experience.
> > However, IMO, this is not about WHEN we should let other developers know
> > they have made a mistake by not adding newly developed ops to the AMP list,
> > but rather about whether we should do it in this way and thinking about its
> > long-term impact on development.
> >
> > Asking other developers to delve into the feature they are not necessarily
> > aware of has imposed a too strong burden and will hinder others'
> > development work. FP16 is a feature nice-to-have in model training, but has
> > not become a standard so that every developer should be aware of. As we are
> > growing the community by encouraging more junior developers to contribute,
> > such error not caused by their work would actually hold them back from
> > moving forward.
> >
> > IMO, AMP is not supposed cover the operators it has not seen and should
> > treat them as those in FP32_FUNCS to be safe. We should come up with a
> > maintenance plan of promoting newly added operators that can benefit from
> > FP16 computing into the appropriate lists, instead of shifting the burden
> > away to many other developers at the moment.
> >
> > On Tue, May 28, 2019 at 4:13 PM Sheng Zha <zhasheng@apache.org> wrote:
> >
> > > The support for AMP should not be a burden of authors of new operators.
> > > The lint analogy doesn't apply because lint is for established and
> > accepted
> > > coding standard at MXNet and AMP is not. AMP is an experimental feature
> > > right now and it doesn't make sense to require contributors to invest in
> > > it. Keeping this as error will inevitably cause comments like "CI test
> > > failure seems unrelated to my change, please proceed and merge".
> > >
> > > -sz
> > >
> > > On 2019/05/28 22:51:30, Marco de Abreu <marco.g.abreu@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > I'm generally in favour of these kind of tests since they make
> > developers
> > > > aware of changes they have to make which they would usually not be
> > aware
> > > > of. We have a similar test for tutorials, for example. Whenever
> > somebody
> > > > adds a tutorial, there's a validation that assures that all contraints
> > in
> > > > our testing environment are met and that they are properly tied into
> > the
> > > > system. This AMP test fits into the same category in my opinion and we
> > > > never heard bad feedback about these kind of checks.
> > > >
> > > > What seems to be bothering people is the fact that the feedback time is
> > > too
> > > > high. Thus, I'd like to propose to move the test into the sanity-test
> > > stage
> > > > instead of doing it as part of the unit tests which take quite a bit of
> > > > time until they're actually executed. The sanity checks run immediately
> > > and
> > > > give a response within about 1 minute.
> > > >
> > > > While I understand that this might increase the amount of work a
> > > developer
> > > > has to do if they develop a new operator, I think that this is the
> > right
> > > > thing to do. Developers won't know of every single feature other people
> > > > worked on and thus might simply miss adding the support for it. This
> > kind
> > > > of test on the other hand makes them aware of it. If they'd like to opt
> > > > out, it's one single line they would have to change and then they're
> > > > totally fine. On the other hand, this might motivate them to add the
> > > > support since the kernel would be the last piece and everything else
> > > would
> > > > already be implemented.
> > > >
> > > > Considering how often a PR gets declined because of linting errors, I'd
> > > say
> > > > that these kind of errors are WAY more frequent that AMP telling
> > somebody
> > > > to add their operator to a list. Considering that this would only have
> > to
> > > > be done once per operator, that's work of about one minute. Add that to
> > > the
> > > > waiting time of the sanity check and you're left with about five
> > "wasted"
> > > > minutes.
> > > >
> > > > I'm opposed towards adding a warning or treating them as float32 by
> > > default
> > > > since the operator author wouldn't notice. What will happen is that
> > > people
> > > > won't know about AMP and simply forget about low precision in general
> > > until
> > > > they're actively reminded. This check will remind them actively and
> > thus
> > > > bring more attention to the feature. I know that the feature is still
> > > > experimental, but we have just started with the 1.6 branch and thus
> > > there's
> > > > enough time to make the experimental features production ready. Adding
> > > this
> > > > test early on will allow others to add the support for AMP during the
> > > early
> > > > stage of the 1.6 branch instead of asking them in the last few weeks
> > > before
> > > > a release. The result would only be that stuff is rushed or forgotten.
> > > >
> > > > To sum it up: I think this test is good and it should be kept as error,
> > > but
> > > > it should be moved to sanity checks.
> > > >
> > > > -Marco
> > > >
> > > > On Wed, May 29, 2019 at 12:21 AM Sheng Zha <zhasheng@apache.org>
> > wrote:
> > > >
> > > > > Thanks for initiating the discussion.
> > > > >
> > > > > The premise for adding the test was to make sure that AMP feature
is
> > > "not
> > > > > broken", but that's IMO not the right view. AMP is not supposed to
> > > support
> > > > > a new operator it hasn't seen before in the first place. There's
no
> > > way for
> > > > > it to know whether the fp32 cast should happen or not. So AMP feature
> > > > > cannot provide the guarantee that it works for all future operators.
> > > Thus,
> > > > > adding new operators to AMP list should be considered new feature
> > > instead
> > > > > of fixing existing feature.
> > > > >
> > > > > The AMP test that breaks upon the addition of new operator is thus
> > > > > equivalent to forcing developers of the new operator to add the new
> > > support
> > > > > for AMP. This feels wrong. Especially given that AMP is an
> > experimental
> > > > > feature in contrib namespace (i.e. no semver guarantee), this
> > practice
> > > > > should be stopped immediately. We cannot force new developers to
> > invest
> > > > > into experimental feature this way.
> > > > >
> > > > > I'd suggest the following changes:
> > > > > - for new operators that aren't registered in AMP, cast to float32
by
> > > > > default and print one-time warning. People using AMP who want to
> > avoid
> > > > > casting can register it in the AMP's list.
> > > > > - change the test to print warning about the operators that are not
> > > listed
> > > > > so that it's easy to track the problem.
> > > > >
> > > > > -sz
> > > > >
> > > > > On 2019/05/28 21:32:42, Przemys��aw Tr��dak <ptrendx@apache.org>
> > > wrote:
> > > > > > Dear Community,
> > > > > >
> > > > > > One of the recently merged features of the 1.5 release, AMP
> > > (Automatic
> > > > > Mixed Precision) support (PR [1], design doc [5]), introduced a
> > > requirement
> > > > > that every new operator added to MXNet would need to be present in
1
> > > of the
> > > > > lists (in [2]). To make sure that this requirement is not broken
when
> > > > > somebody adds a new operator and does not know about AMP's
> > existence, a
> > > > > test was added to CI ([3]).
> > > > > >
> > > > > > A few people reached out to me (the original author of the feature)
> > > > > saying this test increases a burden on a developer of new operators
> > and
> > > > > should not be an actual error, but just warning (PR for that change
> > > [4]).
> > > > > That is why I would like to present a motivation for it and discuss
> > > with
> > > > > the wider audience why I feel it was necessary.
> > > > > >
> > > > > > First, for people who do not know the details of what AMP is
- it
> > is
> > > a
> > > > > solution that tries to automatically apply best practices of training
> > > in
> > > > > lower precision (FP16) to user's FP32 model in order to fully utilize
> > > > > capabilities of modern GPUs (and potentially other hardware in the
> > > future).
> > > > > It does so by casting to lower precision inputs to operators
> > > benefitting
> > > > > from it, while casting to full precision inputs of operators that
are
> > > > > unsafe to run in lower precision or just do not support it.
> > > > > >
> > > > > > The first iteration of AMP kept 2 main lists of operators -
> > operators
> > > > > that are beneficial and safe to do in fp16 and operators that need
to
> > > be
> > > > > cast to FP32. The problem (raised in review of the PR [6], [8]) is
> > how
> > > to
> > > > > make sure that the feature works as intended and is not inadvertently
> > > > > broken by somebody adding a new operator. The failure scenario here
> > is
> > > > > adding a new operator that does not support FP16 and so should be
> > cast
> > > to
> > > > > FP32, but AMP does not know about its existence and so does not do
> > the
> > > > > casting. The solution proposed in the review was to implicitly treat
> > > all of
> > > > > the unknown operators as FP32-only and keep the list of operators
> > that
> > > work
> > > > > fine in both FP16 and FP32. This solution however does not really
> > work,
> > > > > because there are multiple operators (most notably optimizers) where
> > > > > introducing additional casting of the input to FP32 would break the
> > > > > operator.
> > > > > >
> > > > > > That is why after discussion with a few members of the community,
I
> > > > > decided to proceed with all lists being explicit and introducing
the
> > > test
> > > > > that would fail when somebody added an operator without classifying
> > it
> > > into
> > > > > 1 of the categories, and explain clearly how to do it [7]. It is
not
> > > ideal
> > > > > solution, as it introduces some burden on the developers who are
not
> > > aware
> > > > > about AMP, however in the typical case of adding at most a few
> > > operators to
> > > > > MXNet the inconvenience is I think pretty minor while important for
> > the
> > > > > feature correctness going forward.
> > > > > >
> > > > > > I would like to gather Community feedback and ideas how to handle
> > > this
> > > > > situation.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/14173
> > > > > > [2]
> > > > >
> > >
> > https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
> > > > > > [3]
> > > > >
> > >
> > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
> > > > > > [4] https://github.com/apache/incubator-mxnet/pull/15085
> > > > > > [5]
> > > > >
> > >
> > https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
> > > > > > [6]
> > > > >
> > >
> > https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
> > > > > > [7]
> > > > >
> > >
> > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
> > > > > > [8]
> > > > >
> > >
> > https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
> > > > > >
> > > > >
> > > >
> > >
> >
> 

Mime
View raw message