mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asmus Hetzel <asmushet...@yahoo.de.INVALID>
Subject Re: [DISCUSS] Seeding and determinism on multi-gpu systems.
Date Tue, 09 Jan 2018 09:36:44 GMT
 The issue is tricky. Number generators should return deterministic sets of numbers as Chris
said, but that usually only applies to non-distributed systems. And to some extend, we have
already a distributed system as soon as one cpu and one gpu is involved.
For the usual setup like distributed training, using different seeds on different devices
is a must. You distribute a process that involves random number generation and that means
that you absolutely have to ensure that the sequences on the devices do not correlate. So
this behaviour is intended and correct. We also can not guarantee that random number generation
is deterministic when running on CPU vs. running on GPU.
So what we are dealing here is generating repeatable results, when the application/code section
is running on a single GPU out of a bigger set of available GPUs, but we do not have control
on which one. The crucial line in mxnet is this one (resource.cc):

const uint32_t seed = ctx.dev_id + i * kMaxNumGPUs + global_seed * kRandMagic; 
Here I think it would make sense to add a switch that optionally makes this setting independent
of ctx.dev_id. But we would have to document  really well that this is solely meant for specific
types of debugging/unit testing. 








    Am Montag, 8. Januar 2018, 19:30:02 MEZ hat Chris Olivier <cjolivier01@gmail.com>
Folgendes geschrieben:  
 
 Is it explicitly defined somewhere that random number generators should
always return a deterministic set of numbers given the same seed, or is
that just a side-effect of some hardware not having a better way to
generate random numbers so they use a user-defined seed to kick off the
randomization starting point?

On Mon, Jan 8, 2018 at 9:27 AM, kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Hello MXNet devs,
>
> I wanted to see what people thought about the follow section of code, which
> I think has some subtle pros/cons:
> https://github.com/apache/incubator-mxnet/blob/
> d2a856a3a2abb4e72edc301b8b821f0b75f30722/src/resource.cc#L188
>
> Tobi (tdomhan) from sockeye pointed it out to me after he spent some time
> debugging non-determinism in his model training.
>
> This functionality is well documented here:
> https://mxnet.incubator.apache.org/api/python/ndarray.
> html#mxnet.random.seed
> but I don't think the current api meets all use cases due to this section:
>
> "Random number generators in MXNet are device specific. Therefore, random
> numbers generated from two devices can be different even if they are seeded
> using the same seed."
>
> I'm guessing this is a feature that makes distributed training easier in
> MXNet, you wouldn't want to train the same model on each GPU.  However the
> downside of this is that if you run unit tests on a multi-gpu system, or in
> a training environment where you don't have control over which GPU you use,
> you can't count on deterministic behaviour which you can assert results
> against.  I have a feeling there are non-unit test use cases where you'd
> also want deterministic behaviour independent of which gpu you happen to
> have your code scheduled to run on.
>
> How do others feel about this?  Would it make sense to have some optional
> args in the seed call to have the seed-per-device functionality turned off?
>
> -Kellen
>
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message