mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gavin M. Bell" <gavin.max.b...@gmail.com>
Subject Re: CI impaired
Date Fri, 30 Nov 2018 16:50:35 GMT
Hey Folks,

Marco has been running this change in dev, with flying colors, for some
time. This is not an experiment but a roll out that was announced.  We also
decided to make this change post the release cut so limit the blast radius
from any critical obligations to the community.  Marco is accountable for
this work and will address any issues that may occur as he has been put
on-call.  We have, to our best ability, mitigated as much risk as possible
and now it is time to pull the trigger.  The community will enjoy a bit
more visibility and clarity into the test process which will be
advantageous, as well as allowing us to extend our infrastructure in a way
that affords us more flexibility.

No pending PRs will be impacted.

Thank you for your support as we evolve this system to better serve the
community.

-Gavin

On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
<marco.g.abreu@googlemail.com.invalid> wrote:

> Hello Naveen, this is not an experiment. Everything has been tested in our
> test system and is considered working 100%. This is not a test but actually
> the move into production - the merge into master happened a week ago. We
> now just have to put all PRs into the catalogue, which means that all PRs
> have to be analyzed with the new pipelines - the only thing that will be
> noticeable is that the CI is under higher load.
>
> The pending PRs will not be impacted. The existing pipeline is still
> running in parallel and everything will behave as before.
>
> -Marco
>
> On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mnnaveen@gmail.com> wrote:
>
> > Marco, run your experiments on a branch - set up, test it well and then
> > bring it to the master.
> >
> > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > marco.g.abreu@googlemail.com.INVALID> wrote:
> > >
> > > Hello,
> > >
> > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > possible
> > > to reduce parallel jobs in our CI. You might notice some unfinished
> > jobs. I
> > > will let you know as soon as this process has been completed. Until
> then,
> > > please bare with me since we have hundreds of jobs to run in order to
> > > validate all PRs.
> > >
> > > Best regards,
> > > Marco
> > >
> > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > marco.g.abreu@googlemail.com>
> > > wrote:
> > >
> > >> Hello,
> > >>
> > >> since the release branch has now been cut, I would like to move
> forward
> > >> with the CI improvements for the master branch. This would include the
> > >> following actions:
> > >> 1. Re-enable the new Jenkins job
> > >> 2. Request Apache Infra to move the protected branch check from the
> main
> > >> pipeline to our new ones
> > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
> > >> finalizes the deprecation process
> > >>
> > >> If nobody objects, I would like to start with #1 soon. Mentors, could
> > you
> > >> please assist to create the Apache Infra ticket? I would then take it
> > from
> > >> there and talk to Infra.
> > >>
> > >> Best regards,
> > >> Marco
> > >>
> > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > >> kellen.sunderland@gmail.com> wrote:
> > >>
> > >>> Sorry, [1] meant to reference
> > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > >>>
> > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > >>> kellen.sunderland@gmail.com> wrote:
> > >>>
> > >>>> Marco and I ran into another urgent issue over the weekend that
was
> > >>>> causing builds to fail.  This issue was unrelated to any feature
> > >>>> development work, or other CI fixes applied recently, but it did
> > require
> > >>>> quite a bit of work from Marco (and a little from me) to fix.
> > >>>>
> > >>>> We spent enough time on the problem that it caused us to take a
step
> > >>> back
> > >>>> and consider how we could both fix issues in CI and support the
1.4
> > >>> release
> > >>>> with the least impact possible on MXNet devs.  Marco had planned
to
> > >>> make a
> > >>>> significant change to the CI to fix a long-standing Jenkins error
> [1],
> > >>> but
> > >>>> we feel that most developers would prioritize having a stable build
> > >>>> environment for the next few weeks over having this fix in place.
> > >>>>
> > >>>> To properly introduce a new CI system the intent was to do a gradual
> > >>>> blue/green roll out of the fix.  To manage this rollout would have
> > taken
> > >>>> operational effort and double compute load as we run systems in
> > >>> parallel.
> > >>>> This risks outages due to scaling limits, and we’d rather make
this
> > >>> change
> > >>>> during a period of low-developer activity, i.e. shortly after the
> 1.4
> > >>>> release.
> > >>>>
> > >>>> This means that from now until the 1.4 release, in order to reduce
> > >>>> complexity MXNet developers should only see a single Jenkins
> > >>> verification
> > >>>> check, and a single Travis check.
> > >>>>
> > >>>>
> > >>>
> > >>
> >
>


-- 
Sincerely,
Gavin M. Bell

 "Never mistake a clear view for a short distance."
              -Paul Saffo

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message