mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhao, Patric" <patric.z...@intel.com>
Subject Extend MXNET distributed training with MPI AllReduce
Date Tue, 27 Mar 2018 01:30:48 GMT
Hi MXNET owners/developers,

As you known, the AllReduce and Parameter Severs are two very popular distributed training
modes in DL.

Currently, MXNET only supports parameter server mode and is lack of AllReduce mode. Other
frameworks, like tensorflow, pytorch, caffe, etc, can work with AllReduce.
Based on our analysis and experiments, AllReduce mode can achieves the better scalability
and more efficiency

So, we propose to extend MXNET distributed training with MPI AllReduce mode.
We have implemented a AllReduce prototype in MXNET and the results are very positive.
AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16 while the Parameter
Server requires totally 16 nodes (8 compute nodes + 8 parameter severs) to reach 93.2%.

The whole proposal is available in MXNET wiki. Any feedback are highly appreciated.
https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce

Thanks in advance.

BR,

--Patric


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message