mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com.INVALID>
Subject Re: CI impaired
Date Wed, 21 Nov 2018 13:51:14 GMT
Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
that incident.

If somebody is interested in the details around the outage:

Due to a required maintenance (disk running full), we had to upgrade our
Jenkins master because it was running on Ubuntu 17.04 (for an unknown
reason, it used to be 16.04) and we needed to install some packages. Since
the support for Ubuntu 17.04 was stopped, this resulted in all package
updates and installations to fail because the repositories were taken
offline. Due to the unavailable maintenance package and other issues with
the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
master to Ubuntu 18.04 LTS in order to get back to a supported version with
maintenance tools. During this upgrade, Jenkins was automatically updated
by APT as part of the dist-upgrade process.

In the latest version of Jenkins, some labels have been changed which we
depend on for our auto scaling. To be more specific:
> Waiting for next available executor on mxnetlinux-gpu
has been changed to
> Waiting for next available executor on ‘mxnetlinux-gpu’
Notice the quote characters.

Jenkins does not offer a better way than to parse these messages
unfortunately - there's no standardized way to express queue items. Since
our parser expected the above message without quote signs, this message was
discarded.

We support various queue reasons (5 of them to be exact) that indicate
resource starvation. If we run super low on capacity, the queue reason is
different and we would still be able to scale up, but most of the cases
would have printed the unsupported message. This resulted in reduced
capacity (to be specific, the limit during that time was 1 slave per type).

We have now fixed our autoscaling to automatically strip these characters
and added that message to our test suite.

Best regards,
Marco

On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aaron.s.markham@gmail.com>
wrote:

> Marco, thanks for your hard work on this. I'm super excited about the new
> Jenkins jobs. This is going to be very helpful and improve sanity for our
> PRs and ourselves!
>
> Cheers,
> Aaron
>
> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> <marco.g.abreu@googlemail.com.invalid wrote:
>
> > Hello,
> >
> > the CI is now back up and running. Auto scaling is working as expected
> and
> > it passed our load tests.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > marco.g.abreu@googlemail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to let you know that our CI was impaired and down for the last
> > > few hours. After getting the CI back up, I noticed that our auto
> scaling
> > > broke due to a silent update of Jenkins which broke our
> > upscale-detection.
> > > Manual scaling is currently not possible and stopping the scaling won't
> > > help either because there are currently no p3 instances available,
> which
> > > means that all jobs will fail none the less. In a few hours, the auto
> > > scaling will have recycled all slaves through the down-scale mechanism
> > and
> > > we will be out of capacity. This will lead to resource starvation and
> > thus
> > > timeouts.
> > >
> > > Your PRs will be properly registered by Jenkins, but please expect the
> > > jobs to time out and thus fail your PRs.
> > >
> > > I will fix the auto scaling as soon as I'm awake again.
> > >
> > > Sorry for the caused inconveniences.
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > > 5:30AM now and I've been working for 17 hours.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message