mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebastianb <sebasti...@wolfram.com.INVALID>
Subject Re: Should MXNet 1.3 contain a buggy version of nn.Embedding backward by default?
Date Tue, 24 Jul 2018 14:02:02 GMT
> As MXNet v1.3 is likely to be used a lot with Cuda 9.2 I believe the default behavior
should be changed to use the bug-free but less efficient Kernel.


It would be crazy to do anything else, to be honest. Its a terrible philosophy to say to users
'you can't rely on MXNet to have correct behaviour on the fastest GPU, rather you need to
follow the forums/issues lists in order to know that you need to opt-in to a bug-free implementation'.

> On Jul 24, 2018, at 3:47 AM, Leonard Lausen <leonard-software@lausen.nl> wrote:
> 
> Currently the default kernel of nn.Embedding backward is known to be
> buggy on P3 instances or using Cuda 9.2 (though the issue also occurs on
> other instances with earlier version of Cuda, but less often).
> 
> https://github.com/apache/incubator-mxnet/issues/11314
> 
> There is currently an opt-in for using a bug-free kernel, but it is not
> the default. However, the bug-free kernel is used by default for shape
> smaller 16384.
> 
> Should MXNet ship a more efficient but buggy kernel in v1.3 or use a
> correct but less efficient kernel by default? As MXNet v1.3 is likely to
> be used a lot with Cuda 9.2 I believe the default behavior should be
> changed to use the bug-free but less efficient Kernel. Correctness and
> providing a good user experience should be No. 1 here (?). Then users
> that want a faster but buggy backward kernel can still select to do so.
> Note this only affects the backward pass.
> 
> Hao did related work on improving the take operator
> https://github.com/apache/incubator-mxnet/pull/11326
> https://github.com/apache/incubator-mxnet/pull/11795 which also fixes
> the issue, but he found it to be only "slightly faster" compared to the
> bug-free kernel that is currently under opt-in while leading to CI
> failures on Windows.
> 
> In my experience, there is no speed difference between the current buggy and
> opt-in bug-free kernel, but the GPU utilization of the latter is 100% compared
> to 60% of the former (benchmark script:
> https://github.com/apache/incubator-mxnet/pull/11795#issuecomment-405808567 )


Mime
View raw message