mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com.INVALID>
Subject Re: CI impaired
Date Sun, 25 Nov 2018 03:01:12 GMT
Hello Steffen,

thank you for bringing up these PRs.

I had to abort the builds during the outage which means that the jobs
didn't finish and not even the status propagation could have finished
(hence they show pending instead of failure or aborted).

Recently, we merged a PR that adds utility slaves. This will ensure that
status updates will always be posted, no matter whether the main queue
hangs or not. This means that the status would then be properly reflected
and there should be no hanging pending runs.

I could retrigger all PRs to kick off another round of validation, but this
would result in 240 jobs (2 main pipelines times 120 open PRs) to run.
Since we are currently in the pre-release stage, I wanted to avoid putting
the system under such heavy load.

Instead, I'd kindly like to request the PR creators to make a new commit to
trigger the pipelines. In order to merge a PR, only PR-merge has to pass
and I tried to retrigger all PRs that have been aborted during the outage.
It might have been possible that I missed a few.

Since it's still the weekend and there's not much going on, I can use the
time to trigger all PRs. Please advise whether you think I should move
forward (I expect the CI to finish all PRs within 6-10 hours) or if it's
fine to ask people to retrigger themselves.

Please excuse the caused inconveniences.

Best regards,
Marco


Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel <steffenrochel@gmail.com>
geschrieben:

> Thanks Marco for the updates and resolving the issues.
> However, I do see a number of PR waiting to be merged with inconsistent PR
> validation status check.
> E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 pending
> checks being queued. However, when you look at the details, either the
> checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
> windows-gpu failed, required pr-merge which includes edge, gpu tests
> passed).
> Similar also for other PR with label pr-awaiting-merge (
>
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
> )
> Please advice on resolution.
>
> Regards,
> Steffen
>
> On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
> <marco.g.abreu@googlemail.com.invalid> wrote:
>
> > Thanks everybody, I really appreciate it!
> >
> > Today was a good day, there were no incidents and everything appears to
> be
> > stable. In the meantime I did a deep dive on why we has such a
> significant
> > performance decrease with of our compilation jobs - which then clogged up
> > the queue and resulted in 1000 jobs waiting to be scheduled.
> >
> > The reason was the way how we use ccache to speed up our compilation
> jobs.
> > Usually, this yields us a huge performance improvement (CPU openblas, for
> > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down
> to
> > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting
> factor.
> > Here's some background about how we operate our cache:
> >
> > We use EFS to have a distributed ccache between all of our
> > unrestricted-prod-slaves. EFS is classified for almost unlimited
> > scalability (being consumed by thousands of instances in parallel [1])
> with
> > a theoretical throughput of over 10Gbps. One thing I didn't know when I
> > designed this approach was the method how throughput is being granted.
> > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> > throughput (default is 50MiB/s) [2]. Due to the high load, we consumed
> all
> > of our credits - here's a very interesting graph: [3].
> >
> > To avoid similar incidents in future, I have taken the following actions:
> > 1. I switched EFS from burst-mode to provisioned throughput with 300MB/s
> > (in the graph at [3] you can see how our IO immediately increases - and
> > thus our CI gets faster - as soon as I added provisioned throughput).
> > 2. I created internal follow-up tickets to add monitoring and automated
> > actions.
> >
> > First, we should be notified if we are running low on credits to kick-off
> > an investigation. Second (nice to have), we could have a lambda-function
> > which listens for that event and automatically switches the EFS volume
> from
> > burst-mode to provisioned throughput during high-load-times. The required
> > throughput could be retrieved via CloudWatch and then multiplied by a
> > factor. EFS allows to downgrade the throughput mode 24h after the last
> > changes (to reduce capacity if the load is over) and always allows to
> > upgrade the provisioned capacity (if the load goes even higher). I've
> been
> > looking for a pre-made CloudFormation template to facilitate that, but so
> > far, I haven't been able to find it.
> >
> > I'm now running additional load tests on our test CI environment to
> detect
> > other potential bottlenecks.
> >
> > Thanks a lot for your support!
> >
> > Best regards,
> > Marco
> >
> > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
> > [2]:
> >
> https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
> > [3]: https://i.imgur.com/nboQLOn.png
> >
> > On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <lanking520@live.com> wrote:
> >
> > > Appreciated for your effort and help to make CI a better place!
> > >
> > > Qing
> > >
> > > On 11/21/18, 4:38 PM, "Lin Yuan" <apeforest@gmail.com> wrote:
> > >
> > >     Thanks for your efforts, Marco!
> > >
> > >     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> > > anirudh2290@gmail.com>
> > >     wrote:
> > >
> > >     > Thanks for the quick response and mitigation!
> > >     >
> > >     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> > >     > <marco.g.abreu@googlemail.com.invalid> wrote:
> > >     >
> > >     > > Hello,
> > >     > >
> > >     > > today, CI had some issues and I had to cancel all jobs a few
> > > minutes ago.
> > >     > > This was basically caused by the high load that is currently
> > being
> > > put on
> > >     > > our CI system due to the pre-release efforts for this Friday.
> > >     > >
> > >     > > It's really unfortunate that we just had outages of three core
> > > components
> > >     > > within the last two days - sorry about that!. To recap, we had
> > the
> > >     > > following outages (which are unrelated to the parallel refactor
> > of
> > > the
> > >     > > Jenkins pipeline):
> > >     > > - (yesterday evening) The Jenkins master ran out of disk space
> > and
> > > thus
> > >     > > processed requests at reduced capacity
> > >     > > - (this morning) The Jenkins master got updated which broke our
> > >     > > autoscalings upscaling capabilities.
> > >     > > - (new, this evening) Jenkins API was irresponsive: Due to the
> > high
> > >     > number
> > >     > > of jobs and a bad API design in the Jenkins REST API, the
> > > time-complexity
> > >     > > of a simple create or delete request was quadratic which
> resulted
> > > in all
> > >     > > requests timing out (that was the current outage). This
> resulted
> > > in our
> > >     > > auto scaling to be unable to interface with the Jenkins master.
> > >     > >
> > >     > > I have now made improvements to our REST API calls which
> reduced
> > > the
> > >     > > complexity from O(N^2) to O(1). The reason was an underlying
> > > redirect
> > >     > loop
> > >     > > in the Jenkins createNode and deleteNode REST API in
> combination
> > > with
> > >     > > unrolling the entire slave and job graph (which got quite huge
> > > during
> > >     > > extensive load) upon every single request. Since we had about
> 150
> > >     > > registered slaves and 1000 jobs in the queue, the duration for
> a
> > > single
> > >     > > REST API call rose to up to 45 seconds (we execute up to a few
> > > hundred
> > >     > > queries per auto scaling loop). This lead to our auto scaling
> > > timing out.
> > >     > >
> > >     > > Everything should be back to normal now. I'm closely observing
> > the
> > >     > > situation and I'll let you know if I encounter any additional
> > > issues.
> > >     > >
> > >     > > Again, sorry for any caused inconveniences.
> > >     > >
> > >     > > Best regards,
> > >     > > Marco
> > >     > >
> > >     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> > > gavin.max.bell@gmail.com>
> > >     > > wrote:
> > >     > >
> > >     > > > Yes, let me add to the kudos, very nice work Marco.
> > >     > > >
> > >     > > >
> > >     > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> > >     > > >
> > >     > > >
> > >     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > >     > > > <kellens@amazon.de.INVALID> wrote:
> > >     > > > >
> > >     > > > > Appreciate the big effort in bring the CI back so quickly.
> > > Thanks
> > >     > > Marco.
> > >     > > > >
> > >     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> > >     > marco.g.abreu@googlemail.com
> > >     > > .INVALID>
> > >     > > > wrote:
> > >     > > > > Thanks Aaron! Just for the record, the new Jenkins
jobs
> were
> > >     > unrelated
> > >     > > to
> > >     > > > > that incident.
> > >     > > > >
> > >     > > > > If somebody is interested in the details around the
outage:
> > >     > > > >
> > >     > > > > Due to a required maintenance (disk running full),
we had
> to
> > > upgrade
> > >     > > our
> > >     > > > > Jenkins master because it was running on Ubuntu 17.04
(for
> an
> > > unknown
> > >     > > > > reason, it used to be 16.04) and we needed to install
some
> > > packages.
> > >     > > > Since
> > >     > > > > the support for Ubuntu 17.04 was stopped, this resulted
in
> > all
> > >     > package
> > >     > > > > updates and installations to fail because the repositories
> > > were taken
> > >     > > > > offline. Due to the unavailable maintenance package
and
> other
> > > issues
> > >     > > with
> > >     > > > > the installed OpenJDK8 version, we made the decision
to
> > > upgrade the
> > >     > > > Jenkins
> > >     > > > > master to Ubuntu 18.04 LTS in order to get back to
a
> > supported
> > >     > version
> > >     > > > with
> > >     > > > > maintenance tools. During this upgrade, Jenkins was
> > > automatically
> > >     > > updated
> > >     > > > > by APT as part of the dist-upgrade process.
> > >     > > > >
> > >     > > > > In the latest version of Jenkins, some labels have
been
> > > changed which
> > >     > > we
> > >     > > > > depend on for our auto scaling. To be more specific:
> > >     > > > >> Waiting for next available executor on mxnetlinux-gpu
> > >     > > > > has been changed to
> > >     > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > >     > > > > Notice the quote characters.
> > >     > > > >
> > >     > > > > Jenkins does not offer a better way than to parse these
> > > messages
> > >     > > > > unfortunately - there's no standardized way to express
> queue
> > > items.
> > >     > > Since
> > >     > > > > our parser expected the above message without quote
signs,
> > this
> > >     > message
> > >     > > > was
> > >     > > > > discarded.
> > >     > > > >
> > >     > > > > We support various queue reasons (5 of them to be exact)
> that
> > >     > indicate
> > >     > > > > resource starvation. If we run super low on capacity,
the
> > queue
> > >     > reason
> > >     > > is
> > >     > > > > different and we would still be able to scale up, but
most
> of
> > > the
> > >     > cases
> > >     > > > > would have printed the unsupported message. This resulted
> in
> > > reduced
> > >     > > > > capacity (to be specific, the limit during that time
was 1
> > > slave per
> > >     > > > type).
> > >     > > > >
> > >     > > > > We have now fixed our autoscaling to automatically
strip
> > these
> > >     > > characters
> > >     > > > > and added that message to our test suite.
> > >     > > > >
> > >     > > > > Best regards,
> > >     > > > > Marco
> > >     > > > >
> > >     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> > >     > > aaron.s.markham@gmail.com
> > >     > > > >
> > >     > > > > wrote:
> > >     > > > >
> > >     > > > >> Marco, thanks for your hard work on this. I'm super
> excited
> > > about
> > >     > the
> > >     > > > new
> > >     > > > >> Jenkins jobs. This is going to be very helpful
and improve
> > > sanity
> > >     > for
> > >     > > > our
> > >     > > > >> PRs and ourselves!
> > >     > > > >>
> > >     > > > >> Cheers,
> > >     > > > >> Aaron
> > >     > > > >>
> > >     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > >     > > > >> <marco.g.abreu@googlemail.com.invalid wrote:
> > >     > > > >>
> > >     > > > >>> Hello,
> > >     > > > >>>
> > >     > > > >>> the CI is now back up and running. Auto scaling
is
> working
> > as
> > >     > > expected
> > >     > > > >> and
> > >     > > > >>> it passed our load tests.
> > >     > > > >>>
> > >     > > > >>> Please excuse the caused inconveniences.
> > >     > > > >>>
> > >     > > > >>> Best regards,
> > >     > > > >>> Marco
> > >     > > > >>>
> > >     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu
<
> > >     > > > >>> marco.g.abreu@googlemail.com>
> > >     > > > >>> wrote:
> > >     > > > >>>
> > >     > > > >>>> Hello,
> > >     > > > >>>>
> > >     > > > >>>> I'd like to let you know that our CI was
impaired and
> down
> > > for the
> > >     > > > last
> > >     > > > >>>> few hours. After getting the CI back up,
I noticed that
> > our
> > > auto
> > >     > > > >> scaling
> > >     > > > >>>> broke due to a silent update of Jenkins
which broke our
> > >     > > > >>> upscale-detection.
> > >     > > > >>>> Manual scaling is currently not possible
and stopping
> the
> > > scaling
> > >     > > > won't
> > >     > > > >>>> help either because there are currently
no p3 instances
> > > available,
> > >     > > > >> which
> > >     > > > >>>> means that all jobs will fail none the
less. In a few
> > > hours, the
> > >     > > auto
> > >     > > > >>>> scaling will have recycled all slaves through
the
> > down-scale
> > >     > > mechanism
> > >     > > > >>> and
> > >     > > > >>>> we will be out of capacity. This will lead
to resource
> > > starvation
> > >     > > and
> > >     > > > >>> thus
> > >     > > > >>>> timeouts.
> > >     > > > >>>>
> > >     > > > >>>> Your PRs will be properly registered by
Jenkins, but
> > please
> > > expect
> > >     > > the
> > >     > > > >>>> jobs to time out and thus fail your PRs.
> > >     > > > >>>>
> > >     > > > >>>> I will fix the auto scaling as soon as
I'm awake again.
> > >     > > > >>>>
> > >     > > > >>>> Sorry for the caused inconveniences.
> > >     > > > >>>>
> > >     > > > >>>> Best regards,
> > >     > > > >>>> Marco
> > >     > > > >>>>
> > >     > > > >>>>
> > >     > > > >>>> P.S. Sorry for the brief email and my lack
of further
> > > fixes, but
> > >     > > it's
> > >     > > > >>>> 5:30AM now and I've been working for 17
hours.
> > >     > > > >>>>
> > >     > > > >>>
> > >     > > > >>
> > >     > > >
> > >     > >
> > >     >
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message