mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roshani Nagmote <>
Subject Re: Changes to MPI-operator
Date Tue, 16 Apr 2019 18:21:43 GMT
Sounds good. We(Pinar, Vandana and me) are currently prototyping and we are
planning to start a discussion on dev list once we have some logical
We will share more details soon and seek feedback from the community.


On Mon, Apr 15, 2019 at 5:30 PM Yuan Tang <> wrote:

> I am cc’ing MXNet dev mailing list here.
> Thanks for the note Roshani. Look forward to seeing your contribution!
> Though let’s also discuss this in MXNet dev mailing list since other people
> (e.g. Carl and Lin) might be working on this as well to avoid duplicate
> work.
> Best,
> Yuan
> On Mon, Apr 15, 2019 at 5:51 PM Rong Ou <> wrote:
>> Sounds great! Yes it would be nice to have some examples for MXNet.
>> On Mon, Apr 15, 2019 at 3:36 PM Roshani Nagmote <
>>> wrote:
>>> Hi,
>>> I work on Apache MXNet and recently I used MPI-Operator to run
>>> distributed training with MXNet and horovod on Kubernetes.
>>> I with few other folks tried to adjust the capacity for a training job
>>> based on the available workers and restart the training job from where it
>>> left off if any worker goes away in between.
>>> To do this, we had to do a few modifications to MPI-operator. For
>>> example, updating workerReplicas and launcherRole. Currently, changes are
>>> in my repo and I will be making a PR on MPI-operator with these changes.
>>> Also, planning to contribute few examples. I wanted to reach out to you
>>> first before creating a PR.
>>> Please let me know what your thoughts are on this.
>>> Thanks,
>>> Roshani

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message