mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ye, Zhouhai" <zhouhai...@intel.com>
Subject RE: Extend MXNET distributed training with MPI AllReduce
Date Tue, 27 Mar 2018 08:30:13 GMT
For our current POC:
b. Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives (new). (mpi_collectives
is c++ library depending upon mxnet.)    (Add new type of kvstore in python layer.)

mpi_collectives doesn’t need to be a single c++ library. It’s source code can be compiled
into libmxnet.so.


From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:21 AM
To: Nan Zhu <zhunanmcgill@gmail.com>; dev@mxnet.incubator.apache.org
Cc: Li, Mu <mli@amazon.com>; Lv, Tao A <tao.a.lv@intel.com>; Ma, Guokai <guokai.ma@intel.com>;
Rahul Huilgol <rahulhuilgol@gmail.com>; Ye, Jason Y <jason.y.ye@intel.com>; Zhang,
Rong A <rong.a.zhang@intel.com>; Zhao, Patric <patric.zhao@intel.com>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

You can check mpi.kvstore API Spec in our design doc:

e.g.  We add pushpull and broadcast interface and disable original push and pull in new kvstore.

From: Ye, Zhouhai
Sent: Tuesday, March 27, 2018 11:18 AM
To: 'Nan Zhu' <zhunanmcgill@gmail.com<mailto:zhunanmcgill@gmail.com>>; dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <mli@amazon.com<mailto:mli@amazon.com>>; Lv, Tao A <tao.a.lv@intel.com<mailto:tao.a.lv@intel.com>>;
Ma, Guokai <guokai.ma@intel.com<mailto:guokai.ma@intel.com>>; Rahul Huilgol <rahulhuilgol@gmail.com<mailto:rahulhuilgol@gmail.com>>;
Ye, Jason Y <jason.y.ye@intel.com<mailto:jason.y.ye@intel.com>>; Zhang, Rong A
<rong.a.zhang@intel.com<mailto:rong.a.zhang@intel.com>>; Zhao, Patric <patric.zhao@intel.com<mailto:patric.zhao@intel.com>>
Subject: RE: Extend MXNET distributed training with MPI AllReduce

Hi,
Nan Zhu

As we described in our design doc, there’s two possible code structure (implementation)
: (currently we implement second in our POC)


a.      Implement mpi.kvstore same level as the current kvstores (CPP src/kvstore)   (Adhere
to original kvstore factory pattern)



b.      Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives (new).
(mpi_collectives is c++ library depending upon mxnet.)    (Add new type of kvstore in python
layer.)


For your second question, I think to make a single communication submodule is OK (just like
a.). But an unified abstraction for both PS and Allreduce is very hard.


From: Nan Zhu [mailto:zhunanmcgill@gmail.com]
Sent: Tuesday, March 27, 2018 10:39 AM
To: dev@mxnet.incubator.apache.org<mailto:dev@mxnet.incubator.apache.org>
Cc: Li, Mu <mli@amazon.com<mailto:mli@amazon.com>>; Lv, Tao A <tao.a.lv@intel.com<mailto:tao.a.lv@intel.com>>;
Ma, Guokai <guokai.ma@intel.com<mailto:guokai.ma@intel.com>>; Rahul Huilgol <rahulhuilgol@gmail.com<mailto:rahulhuilgol@gmail.com>>;
Ye, Jason Y <jason.y.ye@intel.com<mailto:jason.y.ye@intel.com>>; Ye, Zhouhai <zhouhai.ye@intel.com<mailto:zhouhai.ye@intel.com>>;
Zhang, Rong A <rong.a.zhang@intel.com<mailto:rong.a.zhang@intel.com>>; Zhao, Patric
<patric.zhao@intel.com<mailto:patric.zhao@intel.com>>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce module as an submodule?
We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a single communication
submodule is possible?

Best,

Nan


On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier <cjolivier01@gmail.com<mailto:cjolivier01@gmail.com>>
wrote:
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric <patric.zhao@intel.com<mailto:patric.zhao@intel.com>>
wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message