mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Yang <carl14...@gmail.com>
Subject Re: Single-Machine Topology-aware Communication
Date Mon, 25 Jun 2018 17:46:23 GMT
I added a few more figures showing how I got the
MXNET_KVSTORE_GPUARRAY_BOUND value [Figures 7(b) and 7(c)]. I
performed a microbenchmark measuring runtime in seconds vs. message
size sent using MXNet's KVStore. Figure 7(b) shows the results of a
crossover point around 1M. Beyond this point, multi-tree seems to show
higher bandwidth, and before this point, single tree higher bandwidth.

However, the 150 push-pulls before waiting microbenchmark [Figure
7(c)] shows the crossover point around 10M if we extrapolate its
behaviour to the right. These could not be plotted due to the memory
consumption being too high since I am using 150 push-pulls of fairly
large size as a proxy for neural network parameters. This combined
with doing a parameter sweep over MXNET_KVSTORE_GPUARRAY_BOUND shown
in Figure 7(a) on VGG suggests that 10M is preferable to 1M.

I currently generate 8 trees whose roots are located at each GPU for
the multiple root case. I use only the first tree when doing the
single tree Reduce and Broadcast. This showed better performance
compared to using different roots in single tree case.

Regards,
Carl

On 6/25/18, Pedro Larroy <pedro.larroy.lists@gmail.com> wrote:
> Nice design document. From where does it come the default value
> of MXNET_KVSTORE_GPUARRAY_BOUND  of 10M?
> Do you generate a tree for each GPU?
>
> Pedro.
>
>
> On Mon, Jun 18, 2018 at 2:30 PM Carl Yang <carl14706@gmail.com> wrote:
>
>> Hi,
>>
>> Currently, we have two methods for single-machine communication:
>> parameter server and NCCL ring reduction. Both of these methods have
>> some downsides. Parameter server does not differentiate between NVLink
>> connections and PCI-E, so it ends up using the higher latency and
>> slower PCI-E connections as frequently as it does NVLink. NCCL uses
>> the ring reduce algorithm, which has higher theoretical latency than
>> other algorithms. I am working on a topology-aware approach that can
>> address these limitations. Design proposal is on cwiki:
>>
>> https://cwiki.apache.org/confluence/display/MXNET/Single+machine+All+Reduce+Topology-aware+Communication
>>
>> Please feel free to let me know if you have any suggestions.
>>
>> Regards,
>> Carl
>>
>

Mime
View raw message