mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Evans <joseph.ev...@gmail.com>
Subject Re: CI Pipeline Change Proposal
Date Tue, 31 Mar 2020 16:37:09 GMT
Thanks everyone for your input. I've created an issue for tracking the
increased sanity build time, but this should be treated as a separate
project.

https://github.com/apache/incubator-mxnet/issues/17945

In the meantime, to keep momentum going on the staggered build pipeline
project, please let me know if there are any concerns moving forward.

Thanks!

On Fri, Mar 27, 2020 at 7:57 AM Marco de Abreu <marco.g.abreu@gmail.com>
wrote:

> The docker cache images can be used by you. They're available in Dockerhub,
> you just have to tweak the docker run method.
>
> The thing is that the scripts CI uses has the intention that layers change
> and thus the cache is used.
>
> If you want to be able to change the layers, then you have to accept the
> fact that docker build is required. If you are fine with accepting whatever
> is published in Dockerhub, then you can use docker run.
>
> The thing which is missing here is a flag in build.py which states whether
> either the local dir or the remote cache should be used to create the
> docker environment. Your dir would still be mounted into the container, but
> the difference is that the layers would not necessarily match to what's in
> your dir. Speak, if you change the layers scripts, they'd obviously not be
> available in the image if you consume Dockerhub.
>
> Would such a feature help you reduce the Painpoints you are encountering?
>
> With regards to the pinning: I think it is not feasible to not do pinning
> but at the same time expect everything to stay the same. I can recommend
> tackling that concern by introducing pinning and having an automated system
> (or an assigned person) which tests later versions on a repeating base.
>
> -Marco
>
> Aaron Markham <aaron.s.markham@gmail.com> schrieb am Fr., 27. März 2020,
> 15:48:
>
> > Sure. That's the fix for now.
> >
> > But, I've noticed that when that's done and there's no process to enforce
> > upgrades and patching, these get really out of date and the problems
> > compound.
> >
> > Plus, when I build locally using docker, I can never seem to get the
> > benefit of the cache. Or at least not in the way I'd expect it to be.
> >
> > On one hand I get to discover all these bugs before they hit prod. Ha.
> > But I'd like to have dockerfiles that pull base images that have v1.6.0
> and
> > each patched major release. Ideally I'd have ones for each language
> binding
> > too.
> > Seems like that might really simplify trying to work with a particular
> > issue. Instead, I'm constantly rebuilding binaries and having to reset
> the
> > submodules and make clean.
> > Seems even more relevant now that we're maintaining master and 1.7.x and
> at
> > least for me, 1.6.x.
> >
> >
> >
> > On Fri, Mar 27, 2020, 00:55 Marco de Abreu <marco.g.abreu@gmail.com>
> > wrote:
> >
> > > What about dependency pinning?
> > >
> > > The cache should not be our method to do dependency pinning and
> > > synchronization.
> > >
> > > -Marco
> > >
> > > Aaron Markham <aaron.s.markham@gmail.com> schrieb am Fr., 27. März
> 2020,
> > > 03:45:
> > >
> > > > I'm dealing with a Ruby dep breaking the site build right now.
> > > > I wish this would be on occasion that I choose, not when Ruby or x
> > > > dependency releases a new version. When the cache expires for Jekyll
> > the
> > > > site won't publish anymore... And CI will be blocked for the website
> > > test.
> > > >
> > > > If we built the base OS and main deps once when we do a minor
> release,
> > > > upload that to dockerhub, then we'd save build time and things
> breaking
> > > > randomly. Users can use those docker images too. At release time we
> do
> > a
> > > > round of updates and testing when we're ready. Can we find a balance
> > > > between caching, prebuilt docker images, freshness, and efficiency?
> > > >
> > > >
> > > > On Thu, Mar 26, 2020, 14:31 Marco de Abreu <marco.g.abreu@gmail.com>
> > > > wrote:
> > > >
> > > > > Correct. But I'm surprised about 2:50min to pull down the images.
> > > > >
> > > > > Maybe it makes sense to use ECR as mirror?
> > > > >
> > > > > -Marco
> > > > >
> > > > > Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. März
2020,
> > > 22:02:
> > > > >
> > > > > > +1 on rebuilding the containers regularly without caching layers.
> > > > > >
> > > > > > We are both pulling down a bunch of docker layers (when docker
> > pulls
> > > an
> > > > > > image) and then building a new container to run the sanity build
> > in.
> > > > > > Pulling down all the layers is what is taking so long (2m50s.)
> > Within
> > > > the
> > > > > > docker build, all the layers are cached, so it doesn't take
long.
> > > > Unless
> > > > > > I'm missing something, it doesn't make much sense to be
> rebuilding
> > > the
> > > > > > image every build.
> > > > > >
> > > > > > On Thu, Mar 26, 2020 at 1:12 PM Lausen, Leonard
> > > > > <lausen@amazon.com.invalid
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > WRT Docker Cache: We need to add a mechanism to invalidate
the
> > > cache
> > > > > and
> > > > > > > rebuild
> > > > > > > the containers on a set schedule. The builds break too
often
> and
> > > the
> > > > > > > breakage is
> > > > > > > only detected when a contributor touches the Dockerfiles
> > (manually
> > > > > > causing
> > > > > > > cache
> > > > > > > invalidation)
> > > > > > >
> > > > > > > On Thu, 2020-03-26 at 16:06 -0400, Aaron Markham wrote:
> > > > > > > > I think it is a good idea to do the sanity check first.
Even
> at
> > > 10
> > > > > > > minutes.
> > > > > > > > And also try to fix the docker cache situation, but
those can
> > be
> > > > > > separate
> > > > > > > > tasks.
> > > > > > > >
> > > > > > > > On Thu, Mar 26, 2020, 12:52 Marco de Abreu <
> > > > marco.g.abreu@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Jenkins doesn't load for me, so let me ask this
way: are we
> > > > > actually
> > > > > > > > > rebuilding every single time or do you mean the
docker
> cache?
> > > > > Pulling
> > > > > > > the
> > > > > > > > > cache should only take a few seconds from my
experience -
> > > docker
> > > > > > build
> > > > > > > > > should be a no-op in most cases.
> > > > > > > > >
> > > > > > > > > -Marco
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Joe Evans <joseph.evans@gmail.com> schrieb
am Do., 26.
> März
> > > > 2020,
> > > > > > > 20:46:
> > > > > > > > >
> > > > > > > > > > The sanity-lint check pulls a docker image
cache, builds
> a
> > > new
> > > > > > > container
> > > > > > > > > > and runs inside. The docker setup is taking
around 3
> > minutes,
> > > > at
> > > > > > > least:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/master/1764/pipeline/39
> > > > > > > > > > We could improve this by not having to build
a new
> > container
> > > > > every
> > > > > > > time.
> > > > > > > > > > Also, our CI containers are huge so it takes
awhile to
> pull
> > > > them
> > > > > > > down.
> > > > > > > > > I'm
> > > > > > > > > > sure we could reduce the size by being a
bit more careful
> > in
> > > > > > building
> > > > > > > > > them
> > > > > > > > > > too.
> > > > > > > > > >
> > > > > > > > > > Joe
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 26, 2020 at 12:33 PM Marco de
Abreu <
> > > > > > > marco.g.abreu@gmail.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Do you know what's driving the duration
for sanity? It
> > used
> > > > to
> > > > > be
> > > > > > > 50
> > > > > > > > > sec
> > > > > > > > > > > execution and 60 sec preparation.
> > > > > > > > > > >
> > > > > > > > > > > -Marco
> > > > > > > > > > >
> > > > > > > > > > > Joe Evans <joseph.evans@gmail.com>
schrieb am Do., 26.
> > > März
> > > > > > 2020,
> > > > > > > > > 20:31:
> > > > > > > > > > > > Thanks Marco and Aaron for your
input.
> > > > > > > > > > > >
> > > > > > > > > > > > > Can you show by how much
the duration will
> increase?
> > > > > > > > > > > >
> > > > > > > > > > > > The average sanity build time
is around 10min, while
> > the
> > > > > > average
> > > > > > > > > build
> > > > > > > > > > > time
> > > > > > > > > > > > for unix-cpu is about 2 hours,
so the entire build
> > > pipeline
> > > > > > would
> > > > > > > > > > > increase
> > > > > > > > > > > > by 2 hours if we required both
unix-cpu and sanity to
> > > > > complete
> > > > > > in
> > > > > > > > > > > parallel.
> > > > > > > > > > > > I took a look at the CloudWatch
metrics we're saving
> > for
> > > > > > Jenkins
> > > > > > > > > jobs.
> > > > > > > > > > > Here
> > > > > > > > > > > > is the failure rate per job, based
on builds
> triggered
> > by
> > > > PRs
> > > > > > in
> > > > > > > the
> > > > > > > > > > past
> > > > > > > > > > > > year. As you can see, the sanity
build failure is
> still
> > > > > fairly
> > > > > > > high
> > > > > > > > > and
> > > > > > > > > > > > would save a lot of unneeded build
jobs.
> > > > > > > > > > > >
> > > > > > > > > > > > Job Successful Failed Failure
Rate
> > > > > > > > > > > > sanity 6900 2729 28.34%
> > > > > > > > > > > > unix-cpu 4268 4786 52.86%
> > > > > > > > > > > > unix-gpu 3686 5637 60.46%
> > > > > > > > > > > > centos-cpu 6777 2809 29.30%
> > > > > > > > > > > > centos-gpu 6318 3350 34.65%
> > > > > > > > > > > > clang 7879 1588 16.77%
> > > > > > > > > > > > edge 7654 1933 20.16%
> > > > > > > > > > > > miscellaneous 8090 1510 15.73%
> > > > > > > > > > > > website 7226 2179 23.17%
> > > > > > > > > > > > windows-cpu 6084 3621 37.31%
> > > > > > > > > > > > windows-gpu 5191 4721 47.63%
> > > > > > > > > > > >
> > > > > > > > > > > > We can start by requiring only
the sanity job to
> > complete
> > > > > > before
> > > > > > > > > > > triggering
> > > > > > > > > > > > the rest, and collect data to
decide if it makes
> sense
> > to
> > > > > > change
> > > > > > > it
> > > > > > > > > > from
> > > > > > > > > > > > there. Any objections to this
approach?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks.
> > > > > > > > > > > > Joe
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Mar 25, 2020 at 9:35 AM
Marco de Abreu <
> > > > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Back then I have created
a system which exports all
> > > > Jenkins
> > > > > > > results
> > > > > > > > > > to
> > > > > > > > > > > > > cloud watch. It does not
include individual test
> > > results
> > > > > but
> > > > > > > rather
> > > > > > > > > > > > stages
> > > > > > > > > > > > > and jobs. The data for the
sanity check should be
> > > > available
> > > > > > > there.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Something I'd also be curious
about is the
> percentage
> > > of
> > > > > the
> > > > > > > > > failures
> > > > > > > > > > > in
> > > > > > > > > > > > > one run. Speak, if a commit
failed, have there been
> > > > > multiple
> > > > > > > jobs
> > > > > > > > > > > failing
> > > > > > > > > > > > > (indicating an error in the
code) or only one or
> two
> > > > > > > (indicating
> > > > > > > > > > > > > flakyness). This should give
us a proper
> > understanding
> > > of
> > > > > how
> > > > > > > > > > > unnecessary
> > > > > > > > > > > > > these runs really are.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Marck
> > > > > > > > > > > > >
> > > > > > > > > > > > > Aaron Markham <aaron.s.markham@gmail.com>
schrieb
> am
> > > > Mi.,
> > > > > > 25.
> > > > > > > März
> > > > > > > > > > > 2020,
> > > > > > > > > > > > > 16:53:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > +1 for sanity check
- that's fast.
> > > > > > > > > > > > > > -1 for unix-cpu - that's
slow and can just hang.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So my suggestion would
be to see the data apart -
> > > > what's
> > > > > > the
> > > > > > > > > > failure
> > > > > > > > > > > > > > rate on the sanity check
and the unix-cpu?
> > Actually,
> > > > can
> > > > > we
> > > > > > > get a
> > > > > > > > > > > > > > table of all of the
tests with this data?!
> > > > > > > > > > > > > > If the sanity check
fails... let's say 20% of the
> > > time,
> > > > > but
> > > > > > > only
> > > > > > > > > > > takes
> > > > > > > > > > > > > > a couple of minutes,
then ya, let's stack it and
> do
> > > > that
> > > > > > one
> > > > > > > > > first.
> > > > > > > > > > > > > > I think unix-cpu needs
to be broken apart. It's
> too
> > > > > complex
> > > > > > > and
> > > > > > > > > > fails
> > > > > > > > > > > > > > in multiple ways. Isolate
the brittle parts. Then
> > we
> > > > can
> > > > > > > > > > > > > > restart/disable those
as needed, while all of the
> > > other
> > > > > > parts
> > > > > > > > > pass
> > > > > > > > > > > and
> > > > > > > > > > > > > > don't have to be rerun.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Mar 25, 2020
at 1:32 AM Marco de Abreu <
> > > > > > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > We had this structure
in the past and the
> > community
> > > > was
> > > > > > > > > bothered
> > > > > > > > > > by
> > > > > > > > > > > > CI
> > > > > > > > > > > > > > > taking more time,
thus we moved to the current
> > > model
> > > > > with
> > > > > > > > > > > everything
> > > > > > > > > > > > > > > parallelized. We'd
basically revert that then.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Can you show by
how much the duration will
> > > increase?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Also, we have zero
test parallelisation, speak
> we
> > > are
> > > > > > > running
> > > > > > > > > one
> > > > > > > > > > > > test
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > 72 core machines
(although multiple workers).
> > > > Wouldn't
> > > > > it
> > > > > > > be
> > > > > > > > > way
> > > > > > > > > > > more
> > > > > > > > > > > > > > > efficient to add
parallelisation and thus
> heavily
> > > > > reduce
> > > > > > > the
> > > > > > > > > time
> > > > > > > > > > > > spent
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > the tasks instead
of staggering?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I feel concerned
that these measures to save
> cost
> > > are
> > > > > > paid
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > form
> > > > > > > > > > > > > > of a
> > > > > > > > > > > > > > > worse user experience.
I see a big potential to
> > > save
> > > > > > costs
> > > > > > > by
> > > > > > > > > > > > > increasing
> > > > > > > > > > > > > > > efficiency while
actually improving the user
> > > > experience
> > > > > > > due to
> > > > > > > > > CI
> > > > > > > > > > > > being
> > > > > > > > > > > > > > > faster.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -Marco
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Joe Evans <joseph.evans@gmail.com>
schrieb am
> > Mi.,
> > > > 25.
> > > > > > > März
> > > > > > > > > > 2020,
> > > > > > > > > > > > > 04:58:
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > First, I just
wanted to introduce myself to
> the
> > > > MXNet
> > > > > > > > > > community.
> > > > > > > > > > > > I’m
> > > > > > > > > > > > > > Joe
> > > > > > > > > > > > > > > > and will be
working with Chai and the AWS
> team
> > to
> > > > > > improve
> > > > > > > > > some
> > > > > > > > > > > > issues
> > > > > > > > > > > > > > > > around MXNet
CI. One of our goals is to
> reduce
> > > the
> > > > > > costs
> > > > > > > > > > > associated
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > running MXNet
CI. The task I’m working on now
> > is
> > > > this
> > > > > > > issue:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/17802
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Proposal:
Staggered Jenkins CI pipeline
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Based on data
collected from Jenkins, around
> > 55%
> > > of
> > > > > the
> > > > > > > time
> > > > > > > > > > when
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > mxnet-validation
CI build is triggered by a
> PR,
> > > > > either
> > > > > > > the
> > > > > > > > > > sanity
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > > unix-cpu builds
fail. When either of these
> > builds
> > > > > fail,
> > > > > > > it
> > > > > > > > > > > doesn’t
> > > > > > > > > > > > > make
> > > > > > > > > > > > > > > > sense to run
the rest of the pipelines and
> > > utilize
> > > > > all
> > > > > > > those
> > > > > > > > > > > > > resources
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > we’ve already
identified a build or unit test
> > > > > failure.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We are proposing
changing the MXNet Jenkins
> CI
> > > > > pipeline
> > > > > > > by
> > > > > > > > > > > > requiring
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > *sanity* and
*unix-cpu* builds to complete
> and
> > > pass
> > > > > > tests
> > > > > > > > > > > > > successfully
> > > > > > > > > > > > > > > > before starting
the other build pipelines
> > > > > > > (centos-cpu/gpu,
> > > > > > > > > > > > unix-gpu,
> > > > > > > > > > > > > > > > windows-cpu/gpu,
etc.) Once the sanity builds
> > > > > > > successfully
> > > > > > > > > > > > complete,
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > remaining
build pipelines will be triggered
> and
> > > run
> > > > > in
> > > > > > > > > parallel
> > > > > > > > > > > (as
> > > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > currently
do.) The purpose of this change is
> to
> > > > > > identify
> > > > > > > > > faulty
> > > > > > > > > > > > code
> > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > compatibility
issues early and prevent
> further
> > > > > > execution
> > > > > > > of
> > > > > > > > > CI
> > > > > > > > > > > > > builds.
> > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > will increase
the time required to test a PR,
> > but
> > > > > will
> > > > > > > > > prevent
> > > > > > > > > > > > > > unnecessary
> > > > > > > > > > > > > > > > builds from
running.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Does anyone
have any concerns with this
> change
> > or
> > > > > > > > > suggestions?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Joe Evans
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > joseph.evans@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message