mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anirudh Subramanian <anirudh2...@gmail.com>
Subject Re: Making new operators and AMP lists
Date Tue, 28 May 2019 22:20:40 GMT
Hi all,

I had discussion with Przemyslaw about this offline. There are two options
we can pursue to make developer experience better ( Since currently they
have to wait for CI to complete):

1. Obtain the current lists and check if the length of the combined lists
is same as MXListAllOpNames which gets trigger during "import mxnet". This
provides much early feedback to the developer instead of compiling, testing
and pushing code and waiting to be failed in the CI.
2. Option 1 still has the problem, that developer have to still add op name
to the lists, when adding a new alias. This pain point can be reduced by
adding additional attr for an operator (also suggested by Haibin earlier)
which will tell whether to force cast to FP32, FP16 or no casts. This way
adding a new alias wont cause additional burden on the developer and the
attr if not set can still  be caught early at the statement "import mxnet".

Thoughts?

Anirudh

On Tue, May 28, 2019 at 2:32 PM Przemysław Trędak <ptrendx@apache.org>
wrote:

> Dear Community,
>
> One of the recently merged features of the 1.5 release, AMP (Automatic
> Mixed Precision) support (PR [1], design doc [5]), introduced a requirement
> that every new operator added to MXNet would need to be present in 1 of the
> lists (in [2]). To make sure that this requirement is not broken when
> somebody adds a new operator and does not know about AMP's existence, a
> test was added to CI ([3]).
>
> A few people reached out to me (the original author of the feature) saying
> this test increases a burden on a developer of new operators and should not
> be an actual error, but just warning (PR for that change [4]). That is why
> I would like to present a motivation for it and discuss with the wider
> audience why I feel it was necessary.
>
> First, for people who do not know the details of what AMP is - it is a
> solution that tries to automatically apply best practices of training in
> lower precision (FP16) to user's FP32 model in order to fully utilize
> capabilities of modern GPUs (and potentially other hardware in the future).
> It does so by casting to lower precision inputs to operators benefitting
> from it, while casting to full precision inputs of operators that are
> unsafe to run in lower precision or just do not support it.
>
> The first iteration of AMP kept 2 main lists of operators - operators that
> are beneficial and safe to do in fp16 and operators that need to be cast to
> FP32. The problem (raised in review of the PR [6], [8]) is how to make sure
> that the feature works as intended and is not inadvertently broken by
> somebody adding a new operator. The failure scenario here is adding a new
> operator that does not support FP16 and so should be cast to FP32, but AMP
> does not know about its existence and so does not do the casting. The
> solution proposed in the review was to implicitly treat all of the unknown
> operators as FP32-only and keep the list of operators that work fine in
> both FP16 and FP32. This solution however does not really work, because
> there are multiple operators (most notably optimizers) where introducing
> additional casting of the input to FP32 would break the operator.
>
> That is why after discussion with a few members of the community, I
> decided to proceed with all lists being explicit and introducing the test
> that would fail when somebody added an operator without classifying it into
> 1 of the categories, and explain clearly how to do it [7]. It is not ideal
> solution, as it introduces some burden on the developers who are not aware
> about AMP, however in the typical case of adding at most a few operators to
> MXNet the inconvenience is I think pretty minor while important for the
> feature correctness going forward.
>
> I would like to gather Community feedback and ideas how to handle this
> situation.
>
> [1] https://github.com/apache/incubator-mxnet/pull/14173
> [2]
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/amp/lists/symbol.py
> [3]
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py
> [4] https://github.com/apache/incubator-mxnet/pull/15085
> [5]
> https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b9Fvd1o/edit?usp=sharing
> [6]
> https://github.com/apache/incubator-mxnet/pull/14173#discussion_r270728019
> [7]
> https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_amp.py#L62-L80
> [8]
> https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-235846341
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message