From dev-return-1824-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org  Tue Jan  9 10:49:02 2018
Return-Path: <dev-return-1824-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id A2AA4180718
	for <archive-asf-public@eu.ponee.io>; Tue,  9 Jan 2018 10:49:02 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id 92A3B160C2D; Tue,  9 Jan 2018 09:49:02 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id AE904160C13
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  9 Jan 2018 10:49:01 +0100 (CET)
Received: (qmail 42789 invoked by uid 500); 9 Jan 2018 09:49:00 -0000
Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@mxnet.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@mxnet.incubator.apache.org>
List-Post: <mailto:dev@mxnet.incubator.apache.org>
List-Id: <dev.mxnet.incubator.apache.org>
Reply-To: dev@mxnet.incubator.apache.org
Delivered-To: mailing list dev@mxnet.incubator.apache.org
Received: (qmail 42777 invoked by uid 99); 9 Jan 2018 09:49:00 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jan 2018 09:49:00 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 31BDDC4202
	for <dev@mxnet.apache.org>; Tue,  9 Jan 2018 09:49:00 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.079
X-Spam-Level: ***
X-Spam-Status: No, score=3.079 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, KAM_LINEPADDING=1.2, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id DkWQ7CV0CLux for <dev@mxnet.apache.org>;
	Tue,  9 Jan 2018 09:48:58 +0000 (UTC)
Received: from mail-yw0-f182.google.com (mail-yw0-f182.google.com [209.85.161.182])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id A385D5F24E
	for <dev@mxnet.incubator.apache.org>; Tue,  9 Jan 2018 09:48:58 +0000 (UTC)
Received: by mail-yw0-f182.google.com with SMTP id u21so2277725ywc.2
        for <dev@mxnet.incubator.apache.org>; Tue, 09 Jan 2018 01:48:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to;
        bh=ukZOI53BPeJfgJe8oI1u2q+w23vBKrhd+zj0PxECkmM=;
        b=OQy37+KikURxtknU68eNmTqSXCLKGR+do8fHlrr0R6OWknQAE5E/0sy9Oxb48YM12t
         yjG9WMhdwNqtI03EMM1zfD04mP3vSx4sGpApbsN5z7dVfrSbTfIAEXNeeqMR1vILbdu4
         CFexvRSx9nd08a0I0H/93M27KB8rahJZ8fQNR8io3Fn51k0aYytVZVJzfXbLsnFtWpLL
         JEzlF07YskAjFTWMOvggKR0F5Yo5k8io3N5e/CCVc0Sb2Tx7fbPXEdyrVKEP43dm5bpV
         fgug+m3TfYTZEjagZ9u5O7O2bdKUq/vZN32wzykGkRJv92huFiaHq9amRCswYWDT4/ok
         ETrA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to;
        bh=ukZOI53BPeJfgJe8oI1u2q+w23vBKrhd+zj0PxECkmM=;
        b=AEIi2UxPTTVFFrcy5Rru8DLE4Gvho8Ce9H5GdtHcdMoPlOsOZQWZwXx4HH2G7FefzI
         J+MVDWpgVlEZvMjJ6IJBKuWsevO0MUKi76HfEQAVQrL111gPJ1EBt3cG7dtmVOvrzRa2
         /fPw9wkdRV0ayfurTS2SB3PxFoNQqFI3shETio5zGSdApy9PtS2nOqV/3I4tw72VmAJd
         iidFUUaBu8VHOqEgMUnp4LioXc7yqU1aoihMBcUdfnHD3A/EQUFD09Py01FzzmxvzvvX
         6CYdrJsyzhCRcP0ErjiBFmZQmF3/r+i/QUEQ62mBJfQQf+Z8YEg+TjTJ/H24Dk6TZU1J
         tB7g==
X-Gm-Message-State: AKGB3mLlcE2yAqfkwD7+eeCfpK0QvELT/gjgjINaeJGfhhfaeZG21zdi
	1ucn4ea91r3j4k/d43RRG+ZmbtIdTeLvBV8QmCc=
X-Google-Smtp-Source: ACJfBotMwkXu3Pa1XNLS71yATY24tmxwalhOml4WwjCIqyEE3cqO2kAhLaXzOpy5A5x95p7bn7wzjSIMBFxRRi+k2WU=
X-Received: by 10.129.52.205 with SMTP id b196mr9392143ywa.52.1515491331964;
 Tue, 09 Jan 2018 01:48:51 -0800 (PST)
MIME-Version: 1.0
Received: by 10.37.1.5 with HTTP; Tue, 9 Jan 2018 01:48:51 -0800 (PST)
In-Reply-To: <1639971225.5146069.1515490604049@mail.yahoo.com>
References: <CAHb99TyeNU4mGBo+8RxGu2rMJ2=Jnu_BQ_O+0s62nC+P-Gchrw@mail.gmail.com>
 <CABgAAfcNKczpjjXbG12yxwtaAvN6YOzUNX+fzGaAc98q2gNYzA@mail.gmail.com> <1639971225.5146069.1515490604049@mail.yahoo.com>
From: kellen sunderland <kellen.sunderland@gmail.com>
Date: Tue, 9 Jan 2018 10:48:51 +0100
Message-ID: <CAHb99TxG5t5s+L6w9LUztEUKmZwBHW5RGkmPFU74UrzR5=TO7Q@mail.gmail.com>
Subject: Re: [DISCUSS] Seeding and determinism on multi-gpu systems.
To: dev@mxnet.incubator.apache.org
Content-Type: multipart/alternative; boundary="001a114265d88084b6056254d2f2"

--001a114265d88084b6056254d2f2
Content-Type: text/plain; charset="UTF-8"

Thanks Asmus, yes this is also the approach I would be in favour of.  I
think we should optionally allow the user to specify if they want
deterministic behaviour independent of the GPU they run on.  If MXNet is
going to support more arbitrary linear algabra operations I could see a lot
of use cases for this.  For example I want deterministic noise fed into a
deep-RL simulation so that I can compare a few different algorithms without
variance, and do it in parallel on my machine (that happens to have two
GPUs).

On Tue, Jan 9, 2018 at 10:36 AM, Asmus Hetzel <asmushetzel@yahoo.de.invalid>
wrote:

>  The issue is tricky. Number generators should return deterministic sets
> of numbers as Chris said, but that usually only applies to non-distributed
> systems. And to some extend, we have already a distributed system as soon
> as one cpu and one gpu is involved.
> For the usual setup like distributed training, using different seeds on
> different devices is a must. You distribute a process that involves random
> number generation and that means that you absolutely have to ensure that
> the sequences on the devices do not correlate. So this behaviour is
> intended and correct. We also can not guarantee that random number
> generation is deterministic when running on CPU vs. running on GPU.
> So what we are dealing here is generating repeatable results, when the
> application/code section is running on a single GPU out of a bigger set of
> available GPUs, but we do not have control on which one. The crucial line
> in mxnet is this one (resource.cc):
>
> const uint32_t seed = ctx.dev_id + i * kMaxNumGPUs + global_seed *
> kRandMagic;
> Here I think it would make sense to add a switch that optionally makes
> this setting independent of ctx.dev_id. But we would have to document
> really well that this is solely meant for specific types of debugging/unit
> testing.
>
>
>
>
>
>
>
>
>     Am Montag, 8. Januar 2018, 19:30:02 MEZ hat Chris Olivier <
> cjolivier01@gmail.com> Folgendes geschrieben:
>
>  Is it explicitly defined somewhere that random number generators should
> always return a deterministic set of numbers given the same seed, or is
> that just a side-effect of some hardware not having a better way to
> generate random numbers so they use a user-defined seed to kick off the
> randomization starting point?
>
> On Mon, Jan 8, 2018 at 9:27 AM, kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
>
> > Hello MXNet devs,
> >
> > I wanted to see what people thought about the follow section of code,
> which
> > I think has some subtle pros/cons:
> > https://github.com/apache/incubator-mxnet/blob/
> > d2a856a3a2abb4e72edc301b8b821f0b75f30722/src/resource.cc#L188
> >
> > Tobi (tdomhan) from sockeye pointed it out to me after he spent some time
> > debugging non-determinism in his model training.
> >
> > This functionality is well documented here:
> > https://mxnet.incubator.apache.org/api/python/ndarray.
> > html#mxnet.random.seed
> > but I don't think the current api meets all use cases due to this
> section:
> >
> > "Random number generators in MXNet are device specific. Therefore, random
> > numbers generated from two devices can be different even if they are
> seeded
> > using the same seed."
> >
> > I'm guessing this is a feature that makes distributed training easier in
> > MXNet, you wouldn't want to train the same model on each GPU.  However
> the
> > downside of this is that if you run unit tests on a multi-gpu system, or
> in
> > a training environment where you don't have control over which GPU you
> use,
> > you can't count on deterministic behaviour which you can assert results
> > against.  I have a feeling there are non-unit test use cases where you'd
> > also want deterministic behaviour independent of which gpu you happen to
> > have your code scheduled to run on.
> >
> > How do others feel about this?  Would it make sense to have some optional
> > args in the seed call to have the seed-per-device functionality turned
> off?
> >
> > -Kellen
> >
>
>

--001a114265d88084b6056254d2f2--