mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com>
Subject Re: Auto scaling for MXNet CI
Date Wed, 16 May 2018 15:22:38 GMT
Thanks a lot!

The following numbers are based on our experience in the test environment.
Best case: ~1:50h (unchanged) (0:01 + 0:38 + 0:39 + 0:33 + 0:03) -
conditions: No instances have to be provisioned and caches are primed
Average case: 2:10h (1:50h + 0:10 for instance startup + 0:10 for cache
loading) - conditions: Windows instances are available (they get turned off
less frequently), Ubuntu instances have to be provisioned and cache no
present
Worst case: 3:06h (1:50h + 0:02 + 0:50 + 0:20 + 0:02 + 0:02) - conditions:
no available instances

The bottleneck for the worst case is caused by the Windows instances. They
take about 20 minutes to start and the unprimed MSVC cache results in about
30 minutes increased build times. To balance this out, we're driving a less
aggressive downscaling policy for Windows and use increased buffers. At the
same time, we're currently working on persistent build caches. An
additional option could be reserved instances.

We will observe the situation and then assess the required next steps. For
now, we want to make sure everything is running stable and no builds are
getting interrupted.

Best regards,
Marco

On Wed, May 16, 2018 at 3:47 AM, Thomas DELTEIL <thomas.delteil1@gmail.com>
wrote:

> Great news :) thanks Marco!
>
> On Tue, May 15, 2018, 18:29 Steffen Rochel <steffenrochel@gmail.com>
> wrote:
>
> > Thanks Marco, good step forward.
> > What is the expected, typical and worst case TAT time for PR checks?
> >
> > Steffen
> >
> > On Tue, May 15, 2018 at 10:49 AM Marco de Abreu <
> > marco.g.abreu@googlemail.com> wrote:
> >
> > > Hello,
> > >
> > > I'd like to announce the deployment of auto scaling for our CI system
> > (see
> > > [1] for reference, setup documentation at [2]) for today at 11:00PM PST
> > > 05/15/18. I expect no downtime since these changes are happening
> outside
> > of
> > > Jenkins.
> > >
> > > This system will increase the flexibility of our system to be able to
> > > handle the increasing load, being a result of the growing number of
> great
> > > contributions! In future, our CI will automatically adapt to the
> current
> > > load and will thus support big tasks like the to-be-migrated nightly
> > tests
> > > or increased load before a release. Additionally, we're now able to
> > provide
> > > scalable p3.2xlarge instances and have the possibility to add new
> > instance
> > > types without much effort.
> > >
> > > In future, you will see that new slaves are being started up as the
> queue
> > > grows and stopped if they go into idle. Your tasks will automatically
> be
> > > picked up and our system makes sure every PR gets processes as fast as
> > > possible.
> > >
> > > If you encounter any issues in the next week, please don't hesitate to
> > > reach out to me. I'm looking forward to everyones feedback!
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > [1]:
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/
> Proposal%3A+Auto+Scaling
> > > [2]: https://cwiki.apache.org/confluence/display/MXNET/Setup
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message