mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olivier <cjolivie...@gmail.com>
Subject Re: Extend MXNET distributed training with MPI AllReduce
Date Tue, 27 Mar 2018 02:20:13 GMT
great! nice work!

On Mon, Mar 26, 2018 at 6:31 PM Zhao, Patric <patric.zhao@intel.com> wrote:

> Hi MXNET owners/developers,
>
> As you known, the AllReduce and Parameter Severs are two very popular
> distributed training modes in DL.
>
> Currently, MXNET only supports parameter server mode and is lack of
> AllReduce mode. Other frameworks, like tensorflow, pytorch, caffe, etc, can
> work with AllReduce.
> Based on our analysis and experiments, AllReduce mode can achieves the
> better scalability and more efficiency
>
> So, we propose to extend MXNET distributed training with MPI AllReduce
> mode.
> We have implemented a AllReduce prototype in MXNET and the results are
> very positive.
> AllReduce mode can get 94.7% scale efficiency by 8 compute nodes for VGG16
> while the Parameter Server requires totally 16 nodes (8 compute nodes + 8
> parameter severs) to reach 93.2%.
>
> The whole proposal is available in MXNET wiki. Any feedback are highly
> appreciated.
>
> https://cwiki.apache.org/confluence/display/MXNET/Extend+MXNet+Distributed+Training+by+MPI+AllReduce
>
> Thanks in advance.
>
> BR,
>
> --Patric
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message