mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sheng Zha <>
Subject Re: Making new operators and AMP lists
Date Tue, 28 May 2019 22:21:57 GMT
Thanks for initiating the discussion. 

The premise for adding the test was to make sure that AMP feature is "not broken", but that's
IMO not the right view. AMP is not supposed to support a new operator it hasn't seen before
in the first place. There's no way for it to know whether the fp32 cast should happen or not.
So AMP feature cannot provide the guarantee that it works for all future operators. Thus,
adding new operators to AMP list should be considered new feature instead of fixing existing

The AMP test that breaks upon the addition of new operator is thus equivalent to forcing developers
of the new operator to add the new support for AMP. This feels wrong. Especially given that
AMP is an experimental feature in contrib namespace (i.e. no semver guarantee), this practice
should be stopped immediately. We cannot force new developers to invest into experimental
feature this way.

I'd suggest the following changes:
- for new operators that aren't registered in AMP, cast to float32 by default and print one-time
warning. People using AMP who want to avoid casting can register it in the AMP's list.
- change the test to print warning about the operators that are not listed so that it's easy
to track the problem.


On 2019/05/28 21:32:42, Przemys��aw Tr��dak <> wrote: 
> Dear Community,
> One of the recently merged features of the 1.5 release, AMP (Automatic Mixed Precision)
support (PR [1], design doc [5]), introduced a requirement that every new operator added to
MXNet would need to be present in 1 of the lists (in [2]). To make sure that this requirement
is not broken when somebody adds a new operator and does not know about AMP's existence, a
test was added to CI ([3]).
> A few people reached out to me (the original author of the feature) saying this test
increases a burden on a developer of new operators and should not be an actual error, but
just warning (PR for that change [4]). That is why I would like to present a motivation for
it and discuss with the wider audience why I feel it was necessary.
> First, for people who do not know the details of what AMP is - it is a solution that
tries to automatically apply best practices of training in lower precision (FP16) to user's
FP32 model in order to fully utilize capabilities of modern GPUs (and potentially other hardware
in the future). It does so by casting to lower precision inputs to operators benefitting from
it, while casting to full precision inputs of operators that are unsafe to run in lower precision
or just do not support it.
> The first iteration of AMP kept 2 main lists of operators - operators that are beneficial
and safe to do in fp16 and operators that need to be cast to FP32. The problem (raised in
review of the PR [6], [8]) is how to make sure that the feature works as intended and is not
inadvertently broken by somebody adding a new operator. The failure scenario here is adding
a new operator that does not support FP16 and so should be cast to FP32, but AMP does not
know about its existence and so does not do the casting. The solution proposed in the review
was to implicitly treat all of the unknown operators as FP32-only and keep the list of operators
that work fine in both FP16 and FP32. This solution however does not really work, because
there are multiple operators (most notably optimizers) where introducing additional casting
of the input to FP32 would break the operator.
> That is why after discussion with a few members of the community, I decided to proceed
with all lists being explicit and introducing the test that would fail when somebody added
an operator without classifying it into 1 of the categories, and explain clearly how to do
it [7]. It is not ideal solution, as it introduces some burden on the developers who are not
aware about AMP, however in the typical case of adding at most a few operators to MXNet the
inconvenience is I think pretty minor while important for the feature correctness going forward.
> I would like to gather Community feedback and ideas how to handle this situation.
> [1]
> [2]
> [3]
> [4]
> [5]
> [6]
> [7]
> [8]

View raw message