mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Haibin Lin <notificati...@github.com>
Subject [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)
Date Tue, 12 Nov 2019 19:42:53 GMT
## Background  
Data parallel training is the most common distributed training technique when it comes to
multiple GPUs or multiple hosts. Currently, several communication backends provide functionalities
for communicating tensors across devices/hosts for data parallel training. For MXNet users,
there are a few options: 
1. native kvstore
2. [p3 kvstore](https://www.sysml.cc/doc/2019/75.pdf)
3. [horovod](https://github.com/horovod/horovod/) 
4. [bytePS](https://github.com/bytedance/byteps/)

These different implementations provide different APIs:
- native kvstore
  - high level APIs: `mx.gluon.Trainer`
  - low level APIs: `kv.push`, `kv.pull`, `kv.init`
- horovod
  - high level APIs: `hvd.init()`, `hvd.DistributedTrainer`
  - low level APIs: `hvd.broadcast`, `hvd.allreduce`
- bytePS
  - high level APIs: `bps.init()`, `bps.DistributedTrainer`
  - low level APIs: `byteps_declare_tensor`, `byteps_push_pull`

Here, high level APIs refers to the API a typical novice user uses for a distributed training
job. To communicate tensors not managed by `Trainer` or `DistributedTrainer`s, users may refer
to the low level APIs to send/receive a custom tensor. 

## Problem Statement
Sometimes we want to easily switch between these different distributed communication backends
and compare which one performs the best for a particular distributed training environment.
Due to different APIs of these implementations, it requires lots of user code changes to try
each one of them. It typically involves custom logics to:
1. launch python processes for distributed training job ([BytePS launch](https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#mxnet)
v.s. [horovod](https://github.com/horovod/horovod#running-horovod)) 
2. initialize communication backends ([example code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L187-L228))
3. create (Distributed)Trainers ([example code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L297-L303))
4. send custom tensors ([example code](https://github.com/eric-haibin-lin/gluon-nlp/blob/benchmark/scripts/bert/run_pretraining.py#L582-L586))

## Proposal 

My proposal is to provide a unified API to allow custom communication backends as plugins
for MXNet, so that no new user code is required to switch between these backends.

Specifically, communication backend provider implements the following python APIs.

class `AbstractKVStore`:
- def __init__(): initialization
- def broadcast(name, tensor, root_rank): broadcast the `tensor` at `root_rank` to all ranks
  - name: tensor name. int or str
  - tensor: ndarray 
- def push_pull(name, tensor, output): push `tensor` and pull in `output`. When optimizer
is not set, it performs summation of `tensor` from all ranks. The result of the summation
is then pulled back to `output` tensor. 
  - name: tensor name. int or str
  - tensor: ndarray to push
  - output: ndarray to store the output of pull 
- def set_optimizer(optimizer): set the optimizer at parameter servers. Optional interface,
only used for parameter server based backends. 
  - optimizer: mx.optimizer.Optimizer

A communication backend provider can implement these APIs and register a new KVStore in MXNet
via `mx.kv.register()`. For MXNet users, they only need to interact with the following MXNet
APIs:
- using high level APIs for a typical data parallel model training
```
backend = mx.kv.create('horovod')
trainer = mx.gluon.Trainer(kv=backend)
# forward: loss = net(data)
# backward: loss.backward()
# update: trainer.step()
```
- using low level APIs to reduce a custom tensor
```
kv.broadcast("name", custom_ndarray, root_rank=0)
kv.push_pull("name", custom_ndarray, out=custom_ndarray)
```

## Limitation 

The unified interfaces do not advanced features such as sparse ndarrays or gradient compression,
which is less mature and not provided by all communication backends. 

The above proposal targets use case 2,3,4 in the problem statement. It can be extended to
tackle 1 as well if the feedbacks are positive. 

@ymjiang @apeforest @anandj91 @rich-junwang 

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/16795
Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message