From dev-return-4995-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org  Wed Nov 28 15:18:48 2018
Return-Path: <dev-return-4995-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 7D757180658
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 28 Nov 2018 15:18:47 +0100 (CET)
Received: (qmail 18544 invoked by uid 500); 28 Nov 2018 14:18:46 -0000
Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@mxnet.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@mxnet.incubator.apache.org>
List-Post: <mailto:dev@mxnet.incubator.apache.org>
List-Id: <dev.mxnet.incubator.apache.org>
Reply-To: dev@mxnet.incubator.apache.org
Delivered-To: mailing list dev@mxnet.incubator.apache.org
Received: (qmail 18495 invoked by uid 99); 28 Nov 2018 14:18:45 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Nov 2018 14:18:45 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5671E180676
	for <dev@mxnet.apache.org>; Wed, 28 Nov 2018 14:18:45 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -3.961
X-Spam-Level:
X-Spam-Status: No, score=-3.961 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_HIGH=-1.459, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3,
	SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (1024-bit key) header.d=wolfram.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id IJ4K_BEyR3Op for <dev@mxnet.apache.org>;
	Wed, 28 Nov 2018 14:18:43 +0000 (UTC)
Received: from relay-int.wolfram.com (relay.wolfram.com [140.177.205.37])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 38F095FB2B
	for <dev@mxnet.incubator.apache.org>; Wed, 28 Nov 2018 14:11:49 +0000 (UTC)
Received: from [10.99.97.26] (unknown [10.99.97.26])
	by relay-int.wolfram.com (Postfix) with ESMTPSA id 8588AD17A1;
	Wed, 28 Nov 2018 08:11:42 -0600 (CST)
DKIM-Filter: OpenDKIM Filter v2.10.3 relay-int.wolfram.com 8588AD17A1
From: Taliesin Beynon <taliesinb@wolfram.com.INVALID>
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 12.1 \(3445.101.1\))
Subject: trouble with foreach operator in conjunction with multiple GPUs
Message-Id: <14FEEB1C-BA71-4101-9522-5F6E481FB07A@wolfram.com>
Date: Wed, 28 Nov 2018 16:11:39 +0200
Cc: sebastianb <sebastianb@wolfram.com>
To: dev@mxnet.incubator.apache.org
X-Mailer: Apple Mail (2.3445.101.1)

Hello fellow MXNetters

We've seen that the subgraph execution mechanism that is used to run =
things like the foreach operator causes MXExecutorForward to block, =
instead of just issuing the ops in the normal asynchronous way =
(https://github.com/apache/incubator-mxnet/blob/212364b0cba28aeda989378f6e=
630f7a61749bf3/src/executor/graph_executor.cc#L1352). On its own this is =
a surprising fact that can lead to some issues if you're not expecting =
it, like your time being spent in MXExecutorForward instead of WaitAll / =
WaitRead . Is there a reason that this process isn't just automatically =
done on a separate thread for you? Is it to ensure that subsequent ops =
on the original thread are correctly serialized wrt the ops produced by =
the foreach?=20

More importantly, this has the unfortunate implication that if you are =
using multi-device parallelism with foreach, by just looping over your =
executors and calling Forward on them, you will inadvertently serialize =
much of the computation: you can't call Forward on the second executor =
until Forward on the first executor has returned, and the foreach causes =
that first Forward call to block until the forward pass is (mostly) =
done!

So it kills multi-device parallelism unless one starts making thread =
pools so that the one can 'unblock' Forward (and probably the subsequent =
Backward) and have each device's Forward being run in a separate thread.=20=


Is this intended? Are we missing something about how you are supposed to =
use subgraphs in conjunction with multi-device parallelism? It seems =
like a weakness in the current design of subgraph execution. It also =
appears that the python API doesn't have any strategy to deal with this =
issue, as you can see on =
https://github.com/apache/incubator-mxnet/blob/2276bb0e30b1fe601eb288cb4f1=
b673484892d4b/python/mxnet/executor_manager.py#L281, it's not making =
separate threads or anything there.

Thanks!
Tali + Sebastian=