From dev-return-6154-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Tue May 28 22:58:50 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8393318060F for ; Wed, 29 May 2019 00:58:50 +0200 (CEST) Received: (qmail 44049 invoked by uid 500); 28 May 2019 22:58:49 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 44037 invoked by uid 99); 28 May 2019 22:58:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2019 22:58:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D1B5F180AB9 for ; Tue, 28 May 2019 22:58:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.051 X-Spam-Level: ** X-Spam-Status: No, score=2.051 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Jr4ZxEBIAe2D for ; Tue, 28 May 2019 22:58:45 +0000 (UTC) Received: from mail-it1-f182.google.com (mail-it1-f182.google.com [209.85.166.182]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id F1C2960D94 for ; Tue, 28 May 2019 22:20:53 +0000 (UTC) Received: by mail-it1-f182.google.com with SMTP id s16so400121ita.2 for ; Tue, 28 May 2019 15:20:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=V89CiFFOmongvxqcHYP2PWqa16RLLBikgB911eIyE+Q=; b=OrHzhS0JrYr9FupJQGwxTQ3y+sJA1gooIFAl7NeNrzy/IyQacNmoo3x9QEzXP70Kse HVb7MWhgds2IbLBv5jV438QnV1c8OQI1SHhsI2sLycGd9WKRyAVLCqWGjVR5z7JW/3+u 5cWnaletC6XS/yhHIPfR500bBa4+8aoyYeZIrPKQ2tnqzfi64c3OO0A6uBQ0WSLAFpB/ xpodH0p40bkFVjYtVDJUhyPPskTtBD8n0oFhq0YtZ1wzpCQnFaT9Bfm4Z2EO58HB1kV5 R/0XS5XlrCkZC2iTRIAUMQasGaoV4qEZrMqKtw9v+XVoiR4Q2DsBrp4sa9uWWfcBJQdQ FpaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=V89CiFFOmongvxqcHYP2PWqa16RLLBikgB911eIyE+Q=; b=BLwYuDOBS3B93vpeG2OpBazgnPJkTndNLpfxpx8hkWOFGKsXcGeQxC28AAFXeq9t8G SCFFcL9yRHS7/DPR/IhBfaXhA9fOGSvTpuMosbK30PiurUNyXw6xN+z1lbQT3KRdK0QF Bt98hDRZxjyTw3gT7HjJpzaO1WVxYzYjagSjzSsVIKaESBQKO1TPF23htsUl9eF+BwjG NbE43Ngw0HaApVuWy7ESwMHs91reo6OI+qzw005FkcFexdf8MIPulSOK5GT+nBe+TVMI xW6qPPH5rxHgD1ufZOnLcc41HsxrfzBShW0sjmVo55TtUAnY91vjja47bKcOZ5Y7u0ue zXcg== X-Gm-Message-State: APjAAAU9Efh1aX1jB2clLLVtq/JAPotzkFmLTE1FWnFAODcITy6yYcck aZDjWoKgOFPM2cweDCgqPJ6glQBKZFGknrv8Jqo= X-Google-Smtp-Source: APXvYqySHl022hZupXdK37ppcIdgypVLSPGrkw4sv8VHxHo1g+tfyYu+mCXNsjfK2btyo2Z3/A8N2GxYmNmekNrXxDQ= X-Received: by 2002:a24:7610:: with SMTP id z16mr4988586itb.3.1559082052382; Tue, 28 May 2019 15:20:52 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Anirudh Subramanian Date: Tue, 28 May 2019 15:20:40 -0700 Message-ID: Subject: Re: Making new operators and AMP lists To: dev@mxnet.incubator.apache.org Cc: dev@mxnet.apache.org Content-Type: multipart/alternative; boundary="000000000000e878b60589fa13f1" --000000000000e878b60589fa13f1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi all, I had discussion with Przemyslaw about this offline. There are two options we can pursue to make developer experience better ( Since currently they have to wait for CI to complete): 1. Obtain the current lists and check if the length of the combined lists is same as MXListAllOpNames which gets trigger during "import mxnet". This provides much early feedback to the developer instead of compiling, testing and pushing code and waiting to be failed in the CI. 2. Option 1 still has the problem, that developer have to still add op name to the lists, when adding a new alias. This pain point can be reduced by adding additional attr for an operator (also suggested by Haibin earlier) which will tell whether to force cast to FP32, FP16 or no casts. This way adding a new alias wont cause additional burden on the developer and the attr if not set can still be caught early at the statement "import mxnet". Thoughts? Anirudh On Tue, May 28, 2019 at 2:32 PM Przemys=C5=82aw Tr=C4=99dak wrote: > Dear Community, > > One of the recently merged features of the 1.5 release, AMP (Automatic > Mixed Precision) support (PR [1], design doc [5]), introduced a requireme= nt > that every new operator added to MXNet would need to be present in 1 of t= he > lists (in [2]). To make sure that this requirement is not broken when > somebody adds a new operator and does not know about AMP's existence, a > test was added to CI ([3]). > > A few people reached out to me (the original author of the feature) sayin= g > this test increases a burden on a developer of new operators and should n= ot > be an actual error, but just warning (PR for that change [4]). That is wh= y > I would like to present a motivation for it and discuss with the wider > audience why I feel it was necessary. > > First, for people who do not know the details of what AMP is - it is a > solution that tries to automatically apply best practices of training in > lower precision (FP16) to user's FP32 model in order to fully utilize > capabilities of modern GPUs (and potentially other hardware in the future= ). > It does so by casting to lower precision inputs to operators benefitting > from it, while casting to full precision inputs of operators that are > unsafe to run in lower precision or just do not support it. > > The first iteration of AMP kept 2 main lists of operators - operators tha= t > are beneficial and safe to do in fp16 and operators that need to be cast = to > FP32. The problem (raised in review of the PR [6], [8]) is how to make su= re > that the feature works as intended and is not inadvertently broken by > somebody adding a new operator. The failure scenario here is adding a new > operator that does not support FP16 and so should be cast to FP32, but AM= P > does not know about its existence and so does not do the casting. The > solution proposed in the review was to implicitly treat all of the unknow= n > operators as FP32-only and keep the list of operators that work fine in > both FP16 and FP32. This solution however does not really work, because > there are multiple operators (most notably optimizers) where introducing > additional casting of the input to FP32 would break the operator. > > That is why after discussion with a few members of the community, I > decided to proceed with all lists being explicit and introducing the test > that would fail when somebody added an operator without classifying it in= to > 1 of the categories, and explain clearly how to do it [7]. It is not idea= l > solution, as it introduces some burden on the developers who are not awar= e > about AMP, however in the typical case of adding at most a few operators = to > MXNet the inconvenience is I think pretty minor while important for the > feature correctness going forward. > > I would like to gather Community feedback and ideas how to handle this > situation. > > [1] https://github.com/apache/incubator-mxnet/pull/14173 > [2] > https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contri= b/amp/lists/symbol.py > [3] > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unitte= st/test_amp.py > [4] https://github.com/apache/incubator-mxnet/pull/15085 > [5] > https://docs.google.com/document/d/1sQzMoPEwux0WXSWirY07us1POD_6Y8pLYq--b= 9Fvd1o/edit?usp=3Dsharing > [6] > https://github.com/apache/incubator-mxnet/pull/14173#discussion_r27072801= 9 > [7] > https://github.com/apache/incubator-mxnet/blob/master/tests/python/unitte= st/test_amp.py#L62-L80 > [8] > https://github.com/apache/incubator-mxnet/pull/14173#pullrequestreview-23= 5846341 > --000000000000e878b60589fa13f1--