mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Larroy <pedro.larroy.li...@gmail.com>
Subject Re: Single-Machine Topology-aware Communication
Date Mon, 25 Jun 2018 07:16:16 GMT
Nice design document. From where does it come the default value
of MXNET_KVSTORE_GPUARRAY_BOUND  of 10M?
Do you generate a tree for each GPU?

Pedro.


On Mon, Jun 18, 2018 at 2:30 PM Carl Yang <carl14706@gmail.com> wrote:

> Hi,
>
> Currently, we have two methods for single-machine communication:
> parameter server and NCCL ring reduction. Both of these methods have
> some downsides. Parameter server does not differentiate between NVLink
> connections and PCI-E, so it ends up using the higher latency and
> slower PCI-E connections as frequently as it does NVLink. NCCL uses
> the ring reduce algorithm, which has higher theoretical latency than
> other algorithms. I am working on a topology-aware approach that can
> address these limitations. Design proposal is on cwiki:
>
> https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication
>
> Please feel free to let me know if you have any suggestions.
>
> Regards,
> Carl
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message