From dev-return-3737-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org  Tue Jul 24 09:00:27 2018
Return-Path: <dev-return-3737-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 4CE15180626
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 24 Jul 2018 09:00:26 +0200 (CEST)
Received: (qmail 68694 invoked by uid 500); 24 Jul 2018 07:00:25 -0000
Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@mxnet.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@mxnet.incubator.apache.org>
List-Post: <mailto:dev@mxnet.incubator.apache.org>
List-Id: <dev.mxnet.incubator.apache.org>
Reply-To: dev@mxnet.incubator.apache.org
Delivered-To: mailing list dev@mxnet.incubator.apache.org
Received: (qmail 68682 invoked by uid 99); 24 Jul 2018 07:00:24 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Jul 2018 07:00:24 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 14372C0367
	for <dev@mxnet.apache.org>; Tue, 24 Jul 2018 07:00:24 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.888
X-Spam-Level: *
X-Spam-Status: No, score=1.888 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id SkZZYCJNoKMY for <dev@mxnet.apache.org>;
	Tue, 24 Jul 2018 07:00:22 +0000 (UTC)
Received: from mail-oi0-f48.google.com (mail-oi0-f48.google.com [209.85.218.48])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 938BA5F386
	for <dev@mxnet.incubator.apache.org>; Tue, 24 Jul 2018 07:00:21 +0000 (UTC)
Received: by mail-oi0-f48.google.com with SMTP id 13-v6so5706921ois.1
        for <dev@mxnet.incubator.apache.org>; Tue, 24 Jul 2018 00:00:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to;
        bh=I1VXRorTvRR1zjQr2LNmy7Dx5kYIUHtTxl0nTXZBCzA=;
        b=CjnbSL/0RX6rDIIGbQPG6VeLJ276qQ6VbIj/qev0YlsbAo3unsqDp+Xm1Grt0gH10p
         eObcRY6TPPnal3NrAet2Z2NCBBEp/MD/Zahu3+QOIIaoFwFLfoWpfQ06dTs08YurhFz8
         CADMkhLcLI7EOgWG2eBusY2B77syeQ4nm/pVP2YtCkC5y2wocvzjNdZRtwiTaUUDJObq
         Dbof9WK1D/CYYvR30kb1Cs0VRvNQ2tPUfTVXIKbCenDaTOq9jat/LtHC3hSC1eG6SE4C
         PGiwHF+fD+yYkCP2lJcrX2LvuvneXzhOb/RQkuubYPyloY9GD44Xs7XDPIgwEVpJCNwG
         uY4A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to;
        bh=I1VXRorTvRR1zjQr2LNmy7Dx5kYIUHtTxl0nTXZBCzA=;
        b=Bz0IyzpfgUAtACVwquWJ8TjKyKgUo5apQ571ZYjNGw9ceE+mroUnAzXo1Ry140T+gl
         2VaqpKO+A6r4FVn2O/VYJbv7kRrSC08SwA4LV1Zgnnpnf4QJ6QkAxq4XJjKT12xiGuLT
         9FxRrCnl8xtmRQ64opi00bOsbhZEq0AzS1xNvogA4G6Ur5yQQMuU3MbloeeJ7j9pSMog
         GCE8DT/fZWLcfJFFyQTZFBZsmLxN4IWhLuiP7GV034UdQDI/uhPE75Co6nLlr33xpCge
         +AMX+e/ewm/bjYdrYOhWxcIzN35J9GkKQF7YUGXko+FZwrhLjkcE9krWfbcgXPinemcl
         Pjag==
X-Gm-Message-State: AOUpUlEkWW/UFFrQ1lkRxrmTSqRIFV4bpdhABDzesEfs276BTd2M7/Av
	6qO8ApPuimF09jeMiBMU/u9FgUN6IohGbTmmYR3iwYh5
X-Google-Smtp-Source: AAOMgpdhec6bhe8+cbdvqZSg3Uvb1WDXFfV4TbyvmK7JgW9cVlfcVFbAtiz+emd6SY/8zKyIhBsukjt7m+0ngUfxumU=
X-Received: by 2002:aca:5cd4:: with SMTP id q203-v6mr2061315oib.217.1532415620082;
 Tue, 24 Jul 2018 00:00:20 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a9d:3070:0:0:0:0:0 with HTTP; Tue, 24 Jul 2018 00:00:19
 -0700 (PDT)
In-Reply-To: <CAN_BLaTQzTHL3PZAh_QTHfK5cc9PUg0_5esC-XjEH_Xk_MyNDA@mail.gmail.com>
References: <87k1plbe1w.fsf@lausen.nl> <CAAy-JxA0bZLBN3L_xb5WqJAS1q-5vr_akC=uNvixwHT8e=ZuKQ@mail.gmail.com>
 <CAN_BLaTQzTHL3PZAh_QTHfK5cc9PUg0_5esC-XjEH_Xk_MyNDA@mail.gmail.com>
From: Haibin Lin <haibin.lin.aws@gmail.com>
Date: Tue, 24 Jul 2018 00:00:19 -0700
Message-ID: <CACeqGKY+J6v7hmQN1QCcF3U7z-e00JmWeEMfQj7LSiOJynCSfw@mail.gmail.com>
Subject: Re: Should MXNet 1.3 contain a buggy version of nn.Embedding backward
 by default?
To: dev@mxnet.incubator.apache.org
Content-Type: multipart/alternative; boundary="000000000000aeef160571b950c2"

--000000000000aeef160571b950c2
Content-Type: text/plain; charset="UTF-8"

Hi Hao,

Did you look at the AddTakeGrad for sparse gradient
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/indexing_op.cu#L77
 ? If I'm not mistaken, Leonard doesn't see nan values generated by the
sparse gradient kernel. The sparse kernel shares similar parallelization
strategy with the dense AddTakeGradLargeBatch and can be easily adapted to
replace the dense kernel by removing the "lookup_table" argument of the
kernel.

Best,
Haibin


On Mon, Jul 23, 2018 at 11:45 PM, Hao Jin <hjjn.amzn@gmail.com> wrote:

> Hi all,
> Some preliminary benchmark results have been shared on the related PR, and
> what we've found is that based on the sample benchmark with an input on
> which the LargeBatch version is supposed to have a better performance,
> there was no significant increase in performance compared with either the
> new general backward kernel or the AddTakeGrad function, and the LargeBatch
> version is deemed buggy based on Leo's reproduction example given in the
> original issue. I would propose that we delete the LargeBatch version and
> use the AddTakeGrad version by default. If there's no obvious objection
> then we'll go ahead in that direction.
> Hao
>
> On Mon, Jul 23, 2018 at 9:12 PM, Naveen Swamy <mnnaveen@gmail.com> wrote:
>
> > If it is buggy, how does it matter if it is performant or not? I am not
> > seeing the rationale to make the correct version only opt-in.
> >
> >
> > On Mon, Jul 23, 2018 at 6:47 PM, Leonard Lausen <
> > leonard-software@lausen.nl>
> > wrote:
> >
> > > Currently the default kernel of nn.Embedding backward is known to be
> > > buggy on P3 instances or using Cuda 9.2 (though the issue also occurs
> on
> > > other instances with earlier version of Cuda, but less often).
> > >
> > > https://github.com/apache/incubator-mxnet/issues/11314
> > >
> > > There is currently an opt-in for using a bug-free kernel, but it is not
> > > the default. However, the bug-free kernel is used by default for shape
> > > smaller 16384.
> > >
> > > Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
> > > correct but less efficient kernel by default? As MXNet v1.3 is likely
> to
> > > be used a lot with Cuda 9.2 I believe the default behavior should be
> > > changed to use the bug-free but less efficient Kernel. Correctness and
> > > providing a good user experience should be No. 1 here (?). Then users
> > > that want a faster but buggy backward kernel can still select to do so.
> > > Note this only affects the backward pass.
> > >
> > > Hao did related work on improving the take operator
> > > https://github.com/apache/incubator-mxnet/pull/11326
> > > https://github.com/apache/incubator-mxnet/pull/11795 which also fixes
> > > the issue, but he found it to be only "slightly faster" compared to the
> > > bug-free kernel that is currently under opt-in while leading to CI
> > > failures on Windows.
> > >
> > > In my experience, there is no speed difference between the current
> buggy
> > > and
> > > opt-in bug-free kernel, but the GPU utilization of the latter is 100%
> > > compared
> > > to 60% of the former (benchmark script:
> > > https://github.com/apache/incubator-mxnet/pull/11795#
> > > issuecomment-405808567 )
> > >
> >
>

--000000000000aeef160571b950c2--