mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@gmail.com>
Subject Re: CI Pipeline Change Proposal
Date Thu, 26 Mar 2020 20:58:48 GMT
The job which rebuilds the cache has a property where you can set whether
to rebuild the cache from scratch or not. You could duplicate that job,
disable publishing and enable rebuild. Then add an alarm to the result and
you should be golden.

-Marco

Lausen, Leonard <lausen@amazon.com.invalid> schrieb am Do., 26. März 2020,
21:12:

> WRT Docker Cache: We need to add a mechanism to invalidate the cache and
> rebuild
> the containers on a set schedule. The builds break too often and the
> breakage is
> only detected when a contributor touches the Dockerfiles (manually causing
> cache
> invalidation)
>
> On Thu, 2020-03-26 at 16:06 -0400, Aaron Markham wrote:
> > I think it is a good idea to do the sanity check first. Even at 10
> minutes.
> > And also try to fix the docker cache situation, but those can be separate
> > tasks.
> >
> > On Thu, Mar 26, 2020, 12:52 Marco de Abreu <marco.g.abreu@gmail.com>
> wrote:
> >
> > > Jenkins doesn't load for me, so let me ask this way: are we actually
> > > rebuilding every single time or do you mean the docker cache? Pulling
> the
> > > cache should only take a few seconds from my experience - docker build
> > > should be a no-op in most cases.
> > >
> > > -Marco
> > >
> > >
> > > Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. März 2020,
> 20:46:
> > >
> > > > The sanity-lint check pulls a docker image cache, builds a new
> container
> > > > and runs inside. The docker setup is taking around 3 minutes, at
> least:
> > > >
> > > >
> > > >
> > >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/master/1764/pipeline/39
> > > > We could improve this by not having to build a new container every
> time.
> > > > Also, our CI containers are huge so it takes awhile to pull them
> down.
> > > I'm
> > > > sure we could reduce the size by being a bit more careful in building
> > > them
> > > > too.
> > > >
> > > > Joe
> > > >
> > > > On Thu, Mar 26, 2020 at 12:33 PM Marco de Abreu <
> marco.g.abreu@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Do you know what's driving the duration for sanity? It used to be
> 50
> > > sec
> > > > > execution and 60 sec preparation.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. März
2020,
> > > 20:31:
> > > > > > Thanks Marco and Aaron for your input.
> > > > > >
> > > > > > > Can you show by how much the duration will increase?
> > > > > >
> > > > > > The average sanity build time is around 10min, while the average
> > > build
> > > > > time
> > > > > > for unix-cpu is about 2 hours, so the entire build pipeline
would
> > > > > increase
> > > > > > by 2 hours if we required both unix-cpu and sanity to complete
in
> > > > > parallel.
> > > > > > I took a look at the CloudWatch metrics we're saving for Jenkins
> > > jobs.
> > > > > Here
> > > > > > is the failure rate per job, based on builds triggered by PRs
in
> the
> > > > past
> > > > > > year. As you can see, the sanity build failure is still fairly
> high
> > > and
> > > > > > would save a lot of unneeded build jobs.
> > > > > >
> > > > > > Job Successful Failed Failure Rate
> > > > > > sanity 6900 2729 28.34%
> > > > > > unix-cpu 4268 4786 52.86%
> > > > > > unix-gpu 3686 5637 60.46%
> > > > > > centos-cpu 6777 2809 29.30%
> > > > > > centos-gpu 6318 3350 34.65%
> > > > > > clang 7879 1588 16.77%
> > > > > > edge 7654 1933 20.16%
> > > > > > miscellaneous 8090 1510 15.73%
> > > > > > website 7226 2179 23.17%
> > > > > > windows-cpu 6084 3621 37.31%
> > > > > > windows-gpu 5191 4721 47.63%
> > > > > >
> > > > > > We can start by requiring only the sanity job to complete before
> > > > > triggering
> > > > > > the rest, and collect data to decide if it makes sense to change
> it
> > > > from
> > > > > > there. Any objections to this approach?
> > > > > >
> > > > > > Thanks.
> > > > > > Joe
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 25, 2020 at 9:35 AM Marco de Abreu <
> > > > marco.g.abreu@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Back then I have created a system which exports all Jenkins
> results
> > > > to
> > > > > > > cloud watch. It does not include individual test results
but
> rather
> > > > > > stages
> > > > > > > and jobs. The data for the sanity check should be available
> there.
> > > > > > >
> > > > > > > Something I'd also be curious about is the percentage of
the
> > > failures
> > > > > in
> > > > > > > one run. Speak, if a commit failed, have there been multiple
> jobs
> > > > > failing
> > > > > > > (indicating an error in the code) or only one or two
> (indicating
> > > > > > > flakyness). This should give us a proper understanding
of how
> > > > > unnecessary
> > > > > > > these runs really are.
> > > > > > >
> > > > > > > -Marck
> > > > > > >
> > > > > > > Aaron Markham <aaron.s.markham@gmail.com> schrieb
am Mi., 25.
> März
> > > > > 2020,
> > > > > > > 16:53:
> > > > > > >
> > > > > > > > +1 for sanity check - that's fast.
> > > > > > > > -1 for unix-cpu - that's slow and can just hang.
> > > > > > > >
> > > > > > > > So my suggestion would be to see the data apart -
what's the
> > > > failure
> > > > > > > > rate on the sanity check and the unix-cpu? Actually,
can we
> get a
> > > > > > > > table of all of the tests with this data?!
> > > > > > > > If the sanity check fails... let's say 20% of the
time, but
> only
> > > > > takes
> > > > > > > > a couple of minutes, then ya, let's stack it and do
that one
> > > first.
> > > > > > > > I think unix-cpu needs to be broken apart. It's too
complex
> and
> > > > fails
> > > > > > > > in multiple ways. Isolate the brittle parts. Then
we can
> > > > > > > > restart/disable those as needed, while all of the
other parts
> > > pass
> > > > > and
> > > > > > > > don't have to be rerun.
> > > > > > > >
> > > > > > > > On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > We had this structure in the past and the community
was
> > > bothered
> > > > by
> > > > > > CI
> > > > > > > > > taking more time, thus we moved to the current
model with
> > > > > everything
> > > > > > > > > parallelized. We'd basically revert that then.
> > > > > > > > >
> > > > > > > > > Can you show by how much the duration will increase?
> > > > > > > > >
> > > > > > > > > Also, we have zero test parallelisation, speak
we are
> running
> > > one
> > > > > > test
> > > > > > > on
> > > > > > > > > 72 core machines (although multiple workers).
Wouldn't it
> be
> > > way
> > > > > more
> > > > > > > > > efficient to add parallelisation and thus heavily
reduce
> the
> > > time
> > > > > > spent
> > > > > > > > on
> > > > > > > > > the tasks instead of staggering?
> > > > > > > > >
> > > > > > > > > I feel concerned that these measures to save
cost are paid
> in
> > > the
> > > > > > form
> > > > > > > > of a
> > > > > > > > > worse user experience. I see a big potential
to save costs
> by
> > > > > > > increasing
> > > > > > > > > efficiency while actually improving the user
experience
> due to
> > > CI
> > > > > > being
> > > > > > > > > faster.
> > > > > > > > >
> > > > > > > > > -Marco
> > > > > > > > >
> > > > > > > > > Joe Evans <joseph.evans@gmail.com> schrieb
am Mi., 25.
> März
> > > > 2020,
> > > > > > > 04:58:
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > First, I just wanted to introduce myself
to the MXNet
> > > > community.
> > > > > > I’m
> > > > > > > > Joe
> > > > > > > > > > and will be working with Chai and the AWS
team to improve
> > > some
> > > > > > issues
> > > > > > > > > > around MXNet CI. One of our goals is to
reduce the costs
> > > > > associated
> > > > > > > > with
> > > > > > > > > > running MXNet CI. The task I’m working
on now is this
> issue:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/17802
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Proposal: Staggered Jenkins CI pipeline
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Based on data collected from Jenkins, around
55% of the
> time
> > > > when
> > > > > > the
> > > > > > > > > > mxnet-validation CI build is triggered by
a PR, either
> the
> > > > sanity
> > > > > > or
> > > > > > > > > > unix-cpu builds fail. When either of these
builds fail,
> it
> > > > > doesn’t
> > > > > > > make
> > > > > > > > > > sense to run the rest of the pipelines and
utilize all
> those
> > > > > > > resources
> > > > > > > > if
> > > > > > > > > > we’ve already identified a build or unit
test failure.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > We are proposing changing the MXNet Jenkins
CI pipeline
> by
> > > > > > requiring
> > > > > > > > the
> > > > > > > > > > *sanity* and *unix-cpu* builds to complete
and pass tests
> > > > > > > successfully
> > > > > > > > > > before starting the other build pipelines
> (centos-cpu/gpu,
> > > > > > unix-gpu,
> > > > > > > > > > windows-cpu/gpu, etc.) Once the sanity builds
> successfully
> > > > > > complete,
> > > > > > > > the
> > > > > > > > > > remaining build pipelines will be triggered
and run in
> > > parallel
> > > > > (as
> > > > > > > > they
> > > > > > > > > > currently do.) The purpose of this change
is to identify
> > > faulty
> > > > > > code
> > > > > > > or
> > > > > > > > > > compatibility issues early and prevent further
execution
> of
> > > CI
> > > > > > > builds.
> > > > > > > > This
> > > > > > > > > > will increase the time required to test
a PR, but will
> > > prevent
> > > > > > > > unnecessary
> > > > > > > > > > builds from running.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Does anyone have any concerns with this
change or
> > > suggestions?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > Joe Evans
> > > > > > > > > >
> > > > > > > > > > joseph.evans@gmail.com
> > > > > > > > > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message