mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Markham <aaron.s.mark...@gmail.com>
Subject Re: CI Pipeline Change Proposal
Date Fri, 27 Mar 2020 02:44:39 GMT
I'm dealing with a Ruby dep breaking the site build right now.
I wish this would be on occasion that I choose, not when Ruby or x
dependency releases a new version. When the cache expires for Jekyll the
site won't publish anymore... And CI will be blocked for the website test.

If we built the base OS and main deps once when we do a minor release,
upload that to dockerhub, then we'd save build time and things breaking
randomly. Users can use those docker images too. At release time we do a
round of updates and testing when we're ready. Can we find a balance
between caching, prebuilt docker images, freshness, and efficiency?


On Thu, Mar 26, 2020, 14:31 Marco de Abreu <marco.g.abreu@gmail.com> wrote:

> Correct. But I'm surprised about 2:50min to pull down the images.
>
> Maybe it makes sense to use ECR as mirror?
>
> -Marco
>
> Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. März 2020, 22:02:
>
> > +1 on rebuilding the containers regularly without caching layers.
> >
> > We are both pulling down a bunch of docker layers (when docker pulls an
> > image) and then building a new container to run the sanity build in.
> > Pulling down all the layers is what is taking so long (2m50s.) Within the
> > docker build, all the layers are cached, so it doesn't take long. Unless
> > I'm missing something, it doesn't make much sense to be rebuilding the
> > image every build.
> >
> > On Thu, Mar 26, 2020 at 1:12 PM Lausen, Leonard
> <lausen@amazon.com.invalid
> > >
> > wrote:
> >
> > > WRT Docker Cache: We need to add a mechanism to invalidate the cache
> and
> > > rebuild
> > > the containers on a set schedule. The builds break too often and the
> > > breakage is
> > > only detected when a contributor touches the Dockerfiles (manually
> > causing
> > > cache
> > > invalidation)
> > >
> > > On Thu, 2020-03-26 at 16:06 -0400, Aaron Markham wrote:
> > > > I think it is a good idea to do the sanity check first. Even at 10
> > > minutes.
> > > > And also try to fix the docker cache situation, but those can be
> > separate
> > > > tasks.
> > > >
> > > > On Thu, Mar 26, 2020, 12:52 Marco de Abreu <marco.g.abreu@gmail.com>
> > > wrote:
> > > >
> > > > > Jenkins doesn't load for me, so let me ask this way: are we
> actually
> > > > > rebuilding every single time or do you mean the docker cache?
> Pulling
> > > the
> > > > > cache should only take a few seconds from my experience - docker
> > build
> > > > > should be a no-op in most cases.
> > > > >
> > > > > -Marco
> > > > >
> > > > >
> > > > > Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. März
2020,
> > > 20:46:
> > > > >
> > > > > > The sanity-lint check pulls a docker image cache, builds a new
> > > container
> > > > > > and runs inside. The docker setup is taking around 3 minutes,
at
> > > least:
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/master/1764/pipeline/39
> > > > > > We could improve this by not having to build a new container
> every
> > > time.
> > > > > > Also, our CI containers are huge so it takes awhile to pull
them
> > > down.
> > > > > I'm
> > > > > > sure we could reduce the size by being a bit more careful in
> > building
> > > > > them
> > > > > > too.
> > > > > >
> > > > > > Joe
> > > > > >
> > > > > > On Thu, Mar 26, 2020 at 12:33 PM Marco de Abreu <
> > > marco.g.abreu@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Do you know what's driving the duration for sanity? It
used to
> be
> > > 50
> > > > > sec
> > > > > > > execution and 60 sec preparation.
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > > Joe Evans <joseph.evans@gmail.com> schrieb am Do.,
26. März
> > 2020,
> > > > > 20:31:
> > > > > > > > Thanks Marco and Aaron for your input.
> > > > > > > >
> > > > > > > > > Can you show by how much the duration will increase?
> > > > > > > >
> > > > > > > > The average sanity build time is around 10min, while
the
> > average
> > > > > build
> > > > > > > time
> > > > > > > > for unix-cpu is about 2 hours, so the entire build
pipeline
> > would
> > > > > > > increase
> > > > > > > > by 2 hours if we required both unix-cpu and sanity
to
> complete
> > in
> > > > > > > parallel.
> > > > > > > > I took a look at the CloudWatch metrics we're saving
for
> > Jenkins
> > > > > jobs.
> > > > > > > Here
> > > > > > > > is the failure rate per job, based on builds triggered
by PRs
> > in
> > > the
> > > > > > past
> > > > > > > > year. As you can see, the sanity build failure is
still
> fairly
> > > high
> > > > > and
> > > > > > > > would save a lot of unneeded build jobs.
> > > > > > > >
> > > > > > > > Job Successful Failed Failure Rate
> > > > > > > > sanity 6900 2729 28.34%
> > > > > > > > unix-cpu 4268 4786 52.86%
> > > > > > > > unix-gpu 3686 5637 60.46%
> > > > > > > > centos-cpu 6777 2809 29.30%
> > > > > > > > centos-gpu 6318 3350 34.65%
> > > > > > > > clang 7879 1588 16.77%
> > > > > > > > edge 7654 1933 20.16%
> > > > > > > > miscellaneous 8090 1510 15.73%
> > > > > > > > website 7226 2179 23.17%
> > > > > > > > windows-cpu 6084 3621 37.31%
> > > > > > > > windows-gpu 5191 4721 47.63%
> > > > > > > >
> > > > > > > > We can start by requiring only the sanity job to complete
> > before
> > > > > > > triggering
> > > > > > > > the rest, and collect data to decide if it makes sense
to
> > change
> > > it
> > > > > > from
> > > > > > > > there. Any objections to this approach?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > Joe
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Mar 25, 2020 at 9:35 AM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Back then I have created a system which exports
all Jenkins
> > > results
> > > > > > to
> > > > > > > > > cloud watch. It does not include individual test
results
> but
> > > rather
> > > > > > > > stages
> > > > > > > > > and jobs. The data for the sanity check should
be available
> > > there.
> > > > > > > > >
> > > > > > > > > Something I'd also be curious about is the percentage
of
> the
> > > > > failures
> > > > > > > in
> > > > > > > > > one run. Speak, if a commit failed, have there
been
> multiple
> > > jobs
> > > > > > > failing
> > > > > > > > > (indicating an error in the code) or only one
or two
> > > (indicating
> > > > > > > > > flakyness). This should give us a proper understanding
of
> how
> > > > > > > unnecessary
> > > > > > > > > these runs really are.
> > > > > > > > >
> > > > > > > > > -Marck
> > > > > > > > >
> > > > > > > > > Aaron Markham <aaron.s.markham@gmail.com>
schrieb am Mi.,
> > 25.
> > > März
> > > > > > > 2020,
> > > > > > > > > 16:53:
> > > > > > > > >
> > > > > > > > > > +1 for sanity check - that's fast.
> > > > > > > > > > -1 for unix-cpu - that's slow and can just
hang.
> > > > > > > > > >
> > > > > > > > > > So my suggestion would be to see the data
apart - what's
> > the
> > > > > > failure
> > > > > > > > > > rate on the sanity check and the unix-cpu?
Actually, can
> we
> > > get a
> > > > > > > > > > table of all of the tests with this data?!
> > > > > > > > > > If the sanity check fails... let's say 20%
of the time,
> but
> > > only
> > > > > > > takes
> > > > > > > > > > a couple of minutes, then ya, let's stack
it and do that
> > one
> > > > > first.
> > > > > > > > > > I think unix-cpu needs to be broken apart.
It's too
> complex
> > > and
> > > > > > fails
> > > > > > > > > > in multiple ways. Isolate the brittle parts.
Then we can
> > > > > > > > > > restart/disable those as needed, while all
of the other
> > parts
> > > > > pass
> > > > > > > and
> > > > > > > > > > don't have to be rerun.
> > > > > > > > > >
> > > > > > > > > > On Wed, Mar 25, 2020 at 1:32 AM Marco de
Abreu <
> > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > We had this structure in the past and
the community was
> > > > > bothered
> > > > > > by
> > > > > > > > CI
> > > > > > > > > > > taking more time, thus we moved to
the current model
> with
> > > > > > > everything
> > > > > > > > > > > parallelized. We'd basically revert
that then.
> > > > > > > > > > >
> > > > > > > > > > > Can you show by how much the duration
will increase?
> > > > > > > > > > >
> > > > > > > > > > > Also, we have zero test parallelisation,
speak we are
> > > running
> > > > > one
> > > > > > > > test
> > > > > > > > > on
> > > > > > > > > > > 72 core machines (although multiple
workers). Wouldn't
> it
> > > be
> > > > > way
> > > > > > > more
> > > > > > > > > > > efficient to add parallelisation and
thus heavily
> reduce
> > > the
> > > > > time
> > > > > > > > spent
> > > > > > > > > > on
> > > > > > > > > > > the tasks instead of staggering?
> > > > > > > > > > >
> > > > > > > > > > > I feel concerned that these measures
to save cost are
> > paid
> > > in
> > > > > the
> > > > > > > > form
> > > > > > > > > > of a
> > > > > > > > > > > worse user experience. I see a big
potential to save
> > costs
> > > by
> > > > > > > > > increasing
> > > > > > > > > > > efficiency while actually improving
the user experience
> > > due to
> > > > > CI
> > > > > > > > being
> > > > > > > > > > > faster.
> > > > > > > > > > >
> > > > > > > > > > > -Marco
> > > > > > > > > > >
> > > > > > > > > > > Joe Evans <joseph.evans@gmail.com>
schrieb am Mi., 25.
> > > März
> > > > > > 2020,
> > > > > > > > > 04:58:
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > First, I just wanted to introduce
myself to the MXNet
> > > > > > community.
> > > > > > > > I’m
> > > > > > > > > > Joe
> > > > > > > > > > > > and will be working with Chai
and the AWS team to
> > improve
> > > > > some
> > > > > > > > issues
> > > > > > > > > > > > around MXNet CI. One of our goals
is to reduce the
> > costs
> > > > > > > associated
> > > > > > > > > > with
> > > > > > > > > > > > running MXNet CI. The task I’m
working on now is this
> > > issue:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/issues/17802
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Proposal: Staggered Jenkins CI
pipeline
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Based on data collected from Jenkins,
around 55% of
> the
> > > time
> > > > > > when
> > > > > > > > the
> > > > > > > > > > > > mxnet-validation CI build is triggered
by a PR,
> either
> > > the
> > > > > > sanity
> > > > > > > > or
> > > > > > > > > > > > unix-cpu builds fail. When either
of these builds
> fail,
> > > it
> > > > > > > doesn’t
> > > > > > > > > make
> > > > > > > > > > > > sense to run the rest of the pipelines
and utilize
> all
> > > those
> > > > > > > > > resources
> > > > > > > > > > if
> > > > > > > > > > > > we’ve already identified a build
or unit test
> failure.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > We are proposing changing the
MXNet Jenkins CI
> pipeline
> > > by
> > > > > > > > requiring
> > > > > > > > > > the
> > > > > > > > > > > > *sanity* and *unix-cpu* builds
to complete and pass
> > tests
> > > > > > > > > successfully
> > > > > > > > > > > > before starting the other build
pipelines
> > > (centos-cpu/gpu,
> > > > > > > > unix-gpu,
> > > > > > > > > > > > windows-cpu/gpu, etc.) Once the
sanity builds
> > > successfully
> > > > > > > > complete,
> > > > > > > > > > the
> > > > > > > > > > > > remaining build pipelines will
be triggered and run
> in
> > > > > parallel
> > > > > > > (as
> > > > > > > > > > they
> > > > > > > > > > > > currently do.) The purpose of
this change is to
> > identify
> > > > > faulty
> > > > > > > > code
> > > > > > > > > or
> > > > > > > > > > > > compatibility issues early and
prevent further
> > execution
> > > of
> > > > > CI
> > > > > > > > > builds.
> > > > > > > > > > This
> > > > > > > > > > > > will increase the time required
to test a PR, but
> will
> > > > > prevent
> > > > > > > > > > unnecessary
> > > > > > > > > > > > builds from running.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Does anyone have any concerns
with this change or
> > > > > suggestions?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks.
> > > > > > > > > > > >
> > > > > > > > > > > > Joe Evans
> > > > > > > > > > > >
> > > > > > > > > > > > joseph.evans@gmail.com
> > > > > > > > > > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message