mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Markham <aaron.s.mark...@gmail.com>
Subject Re: CI Pipeline Change Proposal
Date Fri, 27 Mar 2020 14:48:01 GMT
Sure. That's the fix for now.

But, I've noticed that when that's done and there's no process to enforce
upgrades and patching, these get really out of date and the problems
compound.

Plus, when I build locally using docker, I can never seem to get the
benefit of the cache. Or at least not in the way I'd expect it to be.

On one hand I get to discover all these bugs before they hit prod. Ha.
But I'd like to have dockerfiles that pull base images that have v1.6.0 and
each patched major release. Ideally I'd have ones for each language binding
too.
Seems like that might really simplify trying to work with a particular
issue. Instead, I'm constantly rebuilding binaries and having to reset the
submodules and make clean.
Seems even more relevant now that we're maintaining master and 1.7.x and at
least for me, 1.6.x.



On Fri, Mar 27, 2020, 00:55 Marco de Abreu <marco.g.abreu@gmail.com> wrote:

> What about dependency pinning?
>
> The cache should not be our method to do dependency pinning and
> synchronization.
>
> -Marco
>
> Aaron Markham <aaron.s.markham@gmail.com> schrieb am Fr., 27. März 2020,
> 03:45:
>
> > I'm dealing with a Ruby dep breaking the site build right now.
> > I wish this would be on occasion that I choose, not when Ruby or x
> > dependency releases a new version. When the cache expires for Jekyll the
> > site won't publish anymore... And CI will be blocked for the website
> test.
> >
> > If we built the base OS and main deps once when we do a minor release,
> > upload that to dockerhub, then we'd save build time and things breaking
> > randomly. Users can use those docker images too. At release time we do a
> > round of updates and testing when we're ready. Can we find a balance
> > between caching, prebuilt docker images, freshness, and efficiency?
> >
> >
> > On Thu, Mar 26, 2020, 14:31 Marco de Abreu <marco.g.abreu@gmail.com>
> > wrote:
> >
> > > Correct. But I'm surprised about 2:50min to pull down the images.
> > >
> > > Maybe it makes sense to use ECR as mirror?
> > >
> > > -Marco
> > >
> > > Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. März 2020,
> 22:02:
> > >
> > > > +1 on rebuilding the containers regularly without caching layers.
> > > >
> > > > We are both pulling down a bunch of docker layers (when docker pulls
> an
> > > > image) and then building a new container to run the sanity build in.
> > > > Pulling down all the layers is what is taking so long (2m50s.) Within
> > the
> > > > docker build, all the layers are cached, so it doesn't take long.
> > Unless
> > > > I'm missing something, it doesn't make much sense to be rebuilding
> the
> > > > image every build.
> > > >
> > > > On Thu, Mar 26, 2020 at 1:12 PM Lausen, Leonard
> > > <lausen@amazon.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > WRT Docker Cache: We need to add a mechanism to invalidate the
> cache
> > > and
> > > > > rebuild
> > > > > the containers on a set schedule. The builds break too often and
> the
> > > > > breakage is
> > > > > only detected when a contributor touches the Dockerfiles (manually
> > > > causing
> > > > > cache
> > > > > invalidation)
> > > > >
> > > > > On Thu, 2020-03-26 at 16:06 -0400, Aaron Markham wrote:
> > > > > > I think it is a good idea to do the sanity check first. Even
at
> 10
> > > > > minutes.
> > > > > > And also try to fix the docker cache situation, but those can
be
> > > > separate
> > > > > > tasks.
> > > > > >
> > > > > > On Thu, Mar 26, 2020, 12:52 Marco de Abreu <
> > marco.g.abreu@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Jenkins doesn't load for me, so let me ask this way: are
we
> > > actually
> > > > > > > rebuilding every single time or do you mean the docker
cache?
> > > Pulling
> > > > > the
> > > > > > > cache should only take a few seconds from my experience
-
> docker
> > > > build
> > > > > > > should be a no-op in most cases.
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > >
> > > > > > > Joe Evans <joseph.evans@gmail.com> schrieb am Do.,
26. März
> > 2020,
> > > > > 20:46:
> > > > > > >
> > > > > > > > The sanity-lint check pulls a docker image cache,
builds a
> new
> > > > > container
> > > > > > > > and runs inside. The docker setup is taking around
3 minutes,
> > at
> > > > > least:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/master/1764/pipeline/39
> > > > > > > > We could improve this by not having to build a new
container
> > > every
> > > > > time.
> > > > > > > > Also, our CI containers are huge so it takes awhile
to pull
> > them
> > > > > down.
> > > > > > > I'm
> > > > > > > > sure we could reduce the size by being a bit more
careful in
> > > > building
> > > > > > > them
> > > > > > > > too.
> > > > > > > >
> > > > > > > > Joe
> > > > > > > >
> > > > > > > > On Thu, Mar 26, 2020 at 12:33 PM Marco de Abreu <
> > > > > marco.g.abreu@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Do you know what's driving the duration for sanity?
It used
> > to
> > > be
> > > > > 50
> > > > > > > sec
> > > > > > > > > execution and 60 sec preparation.
> > > > > > > > >
> > > > > > > > > -Marco
> > > > > > > > >
> > > > > > > > > Joe Evans <joseph.evans@gmail.com> schrieb
am Do., 26.
> März
> > > > 2020,
> > > > > > > 20:31:
> > > > > > > > > > Thanks Marco and Aaron for your input.
> > > > > > > > > >
> > > > > > > > > > > Can you show by how much the duration
will increase?
> > > > > > > > > >
> > > > > > > > > > The average sanity build time is around
10min, while the
> > > > average
> > > > > > > build
> > > > > > > > > time
> > > > > > > > > > for unix-cpu is about 2 hours, so the entire
build
> pipeline
> > > > would
> > > > > > > > > increase
> > > > > > > > > > by 2 hours if we required both unix-cpu
and sanity to
> > > complete
> > > > in
> > > > > > > > > parallel.
> > > > > > > > > > I took a look at the CloudWatch metrics
we're saving for
> > > > Jenkins
> > > > > > > jobs.
> > > > > > > > > Here
> > > > > > > > > > is the failure rate per job, based on builds
triggered by
> > PRs
> > > > in
> > > > > the
> > > > > > > > past
> > > > > > > > > > year. As you can see, the sanity build failure
is still
> > > fairly
> > > > > high
> > > > > > > and
> > > > > > > > > > would save a lot of unneeded build jobs.
> > > > > > > > > >
> > > > > > > > > > Job Successful Failed Failure Rate
> > > > > > > > > > sanity 6900 2729 28.34%
> > > > > > > > > > unix-cpu 4268 4786 52.86%
> > > > > > > > > > unix-gpu 3686 5637 60.46%
> > > > > > > > > > centos-cpu 6777 2809 29.30%
> > > > > > > > > > centos-gpu 6318 3350 34.65%
> > > > > > > > > > clang 7879 1588 16.77%
> > > > > > > > > > edge 7654 1933 20.16%
> > > > > > > > > > miscellaneous 8090 1510 15.73%
> > > > > > > > > > website 7226 2179 23.17%
> > > > > > > > > > windows-cpu 6084 3621 37.31%
> > > > > > > > > > windows-gpu 5191 4721 47.63%
> > > > > > > > > >
> > > > > > > > > > We can start by requiring only the sanity
job to complete
> > > > before
> > > > > > > > > triggering
> > > > > > > > > > the rest, and collect data to decide if
it makes sense to
> > > > change
> > > > > it
> > > > > > > > from
> > > > > > > > > > there. Any objections to this approach?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > > Joe
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Mar 25, 2020 at 9:35 AM Marco de
Abreu <
> > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Back then I have created a system which
exports all
> > Jenkins
> > > > > results
> > > > > > > > to
> > > > > > > > > > > cloud watch. It does not include individual
test
> results
> > > but
> > > > > rather
> > > > > > > > > > stages
> > > > > > > > > > > and jobs. The data for the sanity check
should be
> > available
> > > > > there.
> > > > > > > > > > >
> > > > > > > > > > > Something I'd also be curious about
is the percentage
> of
> > > the
> > > > > > > failures
> > > > > > > > > in
> > > > > > > > > > > one run. Speak, if a commit failed,
have there been
> > > multiple
> > > > > jobs
> > > > > > > > > failing
> > > > > > > > > > > (indicating an error in the code) or
only one or two
> > > > > (indicating
> > > > > > > > > > > flakyness). This should give us a proper
understanding
> of
> > > how
> > > > > > > > > unnecessary
> > > > > > > > > > > these runs really are.
> > > > > > > > > > >
> > > > > > > > > > > -Marck
> > > > > > > > > > >
> > > > > > > > > > > Aaron Markham <aaron.s.markham@gmail.com>
schrieb am
> > Mi.,
> > > > 25.
> > > > > März
> > > > > > > > > 2020,
> > > > > > > > > > > 16:53:
> > > > > > > > > > >
> > > > > > > > > > > > +1 for sanity check - that's fast.
> > > > > > > > > > > > -1 for unix-cpu - that's slow
and can just hang.
> > > > > > > > > > > >
> > > > > > > > > > > > So my suggestion would be to see
the data apart -
> > what's
> > > > the
> > > > > > > > failure
> > > > > > > > > > > > rate on the sanity check and the
unix-cpu? Actually,
> > can
> > > we
> > > > > get a
> > > > > > > > > > > > table of all of the tests with
this data?!
> > > > > > > > > > > > If the sanity check fails... let's
say 20% of the
> time,
> > > but
> > > > > only
> > > > > > > > > takes
> > > > > > > > > > > > a couple of minutes, then ya,
let's stack it and do
> > that
> > > > one
> > > > > > > first.
> > > > > > > > > > > > I think unix-cpu needs to be broken
apart. It's too
> > > complex
> > > > > and
> > > > > > > > fails
> > > > > > > > > > > > in multiple ways. Isolate the
brittle parts. Then we
> > can
> > > > > > > > > > > > restart/disable those as needed,
while all of the
> other
> > > > parts
> > > > > > > pass
> > > > > > > > > and
> > > > > > > > > > > > don't have to be rerun.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Mar 25, 2020 at 1:32 AM
Marco de Abreu <
> > > > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > We had this structure in
the past and the community
> > was
> > > > > > > bothered
> > > > > > > > by
> > > > > > > > > > CI
> > > > > > > > > > > > > taking more time, thus we
moved to the current
> model
> > > with
> > > > > > > > > everything
> > > > > > > > > > > > > parallelized. We'd basically
revert that then.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Can you show by how much
the duration will
> increase?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also, we have zero test parallelisation,
speak we
> are
> > > > > running
> > > > > > > one
> > > > > > > > > > test
> > > > > > > > > > > on
> > > > > > > > > > > > > 72 core machines (although
multiple workers).
> > Wouldn't
> > > it
> > > > > be
> > > > > > > way
> > > > > > > > > more
> > > > > > > > > > > > > efficient to add parallelisation
and thus heavily
> > > reduce
> > > > > the
> > > > > > > time
> > > > > > > > > > spent
> > > > > > > > > > > > on
> > > > > > > > > > > > > the tasks instead of staggering?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I feel concerned that these
measures to save cost
> are
> > > > paid
> > > > > in
> > > > > > > the
> > > > > > > > > > form
> > > > > > > > > > > > of a
> > > > > > > > > > > > > worse user experience. I
see a big potential to
> save
> > > > costs
> > > > > by
> > > > > > > > > > > increasing
> > > > > > > > > > > > > efficiency while actually
improving the user
> > experience
> > > > > due to
> > > > > > > CI
> > > > > > > > > > being
> > > > > > > > > > > > > faster.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Marco
> > > > > > > > > > > > >
> > > > > > > > > > > > > Joe Evans <joseph.evans@gmail.com>
schrieb am Mi.,
> > 25.
> > > > > März
> > > > > > > > 2020,
> > > > > > > > > > > 04:58:
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > First, I just wanted
to introduce myself to the
> > MXNet
> > > > > > > > community.
> > > > > > > > > > I’m
> > > > > > > > > > > > Joe
> > > > > > > > > > > > > > and will be working
with Chai and the AWS team to
> > > > improve
> > > > > > > some
> > > > > > > > > > issues
> > > > > > > > > > > > > > around MXNet CI. One
of our goals is to reduce
> the
> > > > costs
> > > > > > > > > associated
> > > > > > > > > > > > with
> > > > > > > > > > > > > > running MXNet CI. The
task I’m working on now is
> > this
> > > > > issue:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/issues/17802
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Proposal: Staggered
Jenkins CI pipeline
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Based on data collected
from Jenkins, around 55%
> of
> > > the
> > > > > time
> > > > > > > > when
> > > > > > > > > > the
> > > > > > > > > > > > > > mxnet-validation CI
build is triggered by a PR,
> > > either
> > > > > the
> > > > > > > > sanity
> > > > > > > > > > or
> > > > > > > > > > > > > > unix-cpu builds fail.
When either of these builds
> > > fail,
> > > > > it
> > > > > > > > > doesn’t
> > > > > > > > > > > make
> > > > > > > > > > > > > > sense to run the rest
of the pipelines and
> utilize
> > > all
> > > > > those
> > > > > > > > > > > resources
> > > > > > > > > > > > if
> > > > > > > > > > > > > > we’ve already identified
a build or unit test
> > > failure.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We are proposing changing
the MXNet Jenkins CI
> > > pipeline
> > > > > by
> > > > > > > > > > requiring
> > > > > > > > > > > > the
> > > > > > > > > > > > > > *sanity* and *unix-cpu*
builds to complete and
> pass
> > > > tests
> > > > > > > > > > > successfully
> > > > > > > > > > > > > > before starting the
other build pipelines
> > > > > (centos-cpu/gpu,
> > > > > > > > > > unix-gpu,
> > > > > > > > > > > > > > windows-cpu/gpu, etc.)
Once the sanity builds
> > > > > successfully
> > > > > > > > > > complete,
> > > > > > > > > > > > the
> > > > > > > > > > > > > > remaining build pipelines
will be triggered and
> run
> > > in
> > > > > > > parallel
> > > > > > > > > (as
> > > > > > > > > > > > they
> > > > > > > > > > > > > > currently do.) The purpose
of this change is to
> > > > identify
> > > > > > > faulty
> > > > > > > > > > code
> > > > > > > > > > > or
> > > > > > > > > > > > > > compatibility issues
early and prevent further
> > > > execution
> > > > > of
> > > > > > > CI
> > > > > > > > > > > builds.
> > > > > > > > > > > > This
> > > > > > > > > > > > > > will increase the time
required to test a PR, but
> > > will
> > > > > > > prevent
> > > > > > > > > > > > unnecessary
> > > > > > > > > > > > > > builds from running.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Does anyone have any
concerns with this change or
> > > > > > > suggestions?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Joe Evans
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > joseph.evans@gmail.com
> > > > > > > > > > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message