mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@googlemail.com.INVALID>
Subject Re: CI impaired
Date Sat, 01 Dec 2018 03:27:35 GMT
Thanks Naveen and Gavin!

#1 has been completed and every job has finished its processing.

#2 is the ticket with infra:
https://issues.apache.org/jira/browse/INFRA-17346

I'm now waiting for their response.

-Marco

On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy <mnnaveen@gmail.com> wrote:

> Hi Marco/Gavin,
>
> Thanks for the clarification. I was not aware that it has been tested on a
> separate test environment(this is what I was suggesting and make the
> changes in a more controlled manner), last time the change was made, many
> PRs were left dangling and developers had to go trigger and I triggered
> them at least 5 times before it succeeded today.
>
> Appreciate all the hard work to make CI better.
>
> -Naveen
>
> On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell <gavin.max.bell@gmail.com>
> wrote:
>
> > Hey Folks,
> >
> > Marco has been running this change in dev, with flying colors, for some
> > time. This is not an experiment but a roll out that was announced.  We
> also
> > decided to make this change post the release cut so limit the blast
> radius
> > from any critical obligations to the community.  Marco is accountable for
> > this work and will address any issues that may occur as he has been put
> > on-call.  We have, to our best ability, mitigated as much risk as
> possible
> > and now it is time to pull the trigger.  The community will enjoy a bit
> > more visibility and clarity into the test process which will be
> > advantageous, as well as allowing us to extend our infrastructure in a
> way
> > that affords us more flexibility.
> >
> > No pending PRs will be impacted.
> >
> > Thank you for your support as we evolve this system to better serve the
> > community.
> >
> > -Gavin
> >
> > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
> > <marco.g.abreu@googlemail.com.invalid> wrote:
> >
> > > Hello Naveen, this is not an experiment. Everything has been tested in
> > our
> > > test system and is considered working 100%. This is not a test but
> > actually
> > > the move into production - the merge into master happened a week ago.
> We
> > > now just have to put all PRs into the catalogue, which means that all
> PRs
> > > have to be analyzed with the new pipelines - the only thing that will
> be
> > > noticeable is that the CI is under higher load.
> > >
> > > The pending PRs will not be impacted. The existing pipeline is still
> > > running in parallel and everything will behave as before.
> > >
> > > -Marco
> > >
> > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mnnaveen@gmail.com>
> wrote:
> > >
> > > > Marco, run your experiments on a branch - set up, test it well and
> then
> > > > bring it to the master.
> > > >
> > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > > > marco.g.abreu@googlemail.com.INVALID> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > > > possible
> > > > > to reduce parallel jobs in our CI. You might notice some unfinished
> > > > jobs. I
> > > > > will let you know as soon as this process has been completed. Until
> > > then,
> > > > > please bare with me since we have hundreds of jobs to run in order
> to
> > > > > validate all PRs.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > > > marco.g.abreu@googlemail.com>
> > > > > wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> since the release branch has now been cut, I would like to move
> > > forward
> > > > >> with the CI improvements for the master branch. This would include
> > the
> > > > >> following actions:
> > > > >> 1. Re-enable the new Jenkins job
> > > > >> 2. Request Apache Infra to move the protected branch check from
> the
> > > main
> > > > >> pipeline to our new ones
> > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474
-
> > this
> > > > >> finalizes the deprecation process
> > > > >>
> > > > >> If nobody objects, I would like to start with #1 soon. Mentors,
> > could
> > > > you
> > > > >> please assist to create the Apache Infra ticket? I would then
take
> > it
> > > > from
> > > > >> there and talk to Infra.
> > > > >>
> > > > >> Best regards,
> > > > >> Marco
> > > > >>
> > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > > > >> kellen.sunderland@gmail.com> wrote:
> > > > >>
> > > > >>> Sorry, [1] meant to reference
> > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > > > >>>
> > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > > > >>> kellen.sunderland@gmail.com> wrote:
> > > > >>>
> > > > >>>> Marco and I ran into another urgent issue over the weekend
that
> > was
> > > > >>>> causing builds to fail.  This issue was unrelated to
any feature
> > > > >>>> development work, or other CI fixes applied recently,
but it did
> > > > require
> > > > >>>> quite a bit of work from Marco (and a little from me)
to fix.
> > > > >>>>
> > > > >>>> We spent enough time on the problem that it caused us
to take a
> > step
> > > > >>> back
> > > > >>>> and consider how we could both fix issues in CI and support
the
> > 1.4
> > > > >>> release
> > > > >>>> with the least impact possible on MXNet devs.  Marco
had planned
> > to
> > > > >>> make a
> > > > >>>> significant change to the CI to fix a long-standing Jenkins
> error
> > > [1],
> > > > >>> but
> > > > >>>> we feel that most developers would prioritize having
a stable
> > build
> > > > >>>> environment for the next few weeks over having this fix
in
> place.
> > > > >>>>
> > > > >>>> To properly introduce a new CI system the intent was
to do a
> > gradual
> > > > >>>> blue/green roll out of the fix.  To manage this rollout
would
> have
> > > > taken
> > > > >>>> operational effort and double compute load as we run
systems in
> > > > >>> parallel.
> > > > >>>> This risks outages due to scaling limits, and we’d
rather make
> > this
> > > > >>> change
> > > > >>>> during a period of low-developer activity, i.e. shortly
after
> the
> > > 1.4
> > > > >>>> release.
> > > > >>>>
> > > > >>>> This means that from now until the 1.4 release, in order
to
> reduce
> > > > >>>> complexity MXNet developers should only see a single
Jenkins
> > > > >>> verification
> > > > >>>> check, and a single Travis check.
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> >
> > --
> > Sincerely,
> > Gavin M. Bell
> >
> >  "Never mistake a clear view for a short distance."
> >               -Paul Saffo
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message