mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ye, Zhouhai" <>
Subject RE: Extend MXNET distributed training with MPI AllReduce
Date Tue, 27 Mar 2018 03:18:19 GMT
Nan Zhu

As we described in our design doc, there’s two possible code structure (implementation)
: (currently we implement second in our POC)

a.      Implement mpi.kvstore same level as the current kvstores (CPP src/kvstore)   (Adhere
to original kvstore factory pattern)

b.      Add mpi.kvstore in python. It depends upon mxnet submodule mpi_collectives (new).
(mpi_collectives is c++ library depending upon mxnet.)    (Add new type of kvstore in python

For your second question, I think to make a single communication submodule is OK (just like
a.). But an unified abstraction for both PS and Allreduce is very hard.

From: Nan Zhu []
Sent: Tuesday, March 27, 2018 10:39 AM
Cc: Li, Mu <>; Lv, Tao A <>; Ma, Guokai <>;
Rahul Huilgol <>; Ye, Jason Y <>; Ye,
Zhouhai <>; Zhang, Rong A <>; Zhao,
Patric <>
Subject: Re: Extend MXNET distributed training with MPI AllReduce

Hi, Patric

It's pretty nice work!

A question:

how the future code structure would look like when putting this allreduce module as an submodule?
We will have two communication submodules?

Is there any plan to give an unified abstraction for communication so that a single communication
submodule is possible?



On Mon, Mar 26, 2018 at 7:20 PM, Chris Olivier <<>>
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric <<>>

> Hi MXNET owners/developers,
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
> Thanks in advance.
> BR,
> --Patric

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message