mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Larroy <pedro.larroy.li...@gmail.com>
Subject Re: CI and PRs
Date Fri, 16 Aug 2019 18:18:21 GMT
Hi Aaron.

As Marco explained, if you are in master the cache usually works, there's
two issues that I have observed:

1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
if your cached base which is used in the FROM statement becomes outdated
your caching won't work. (Using docker pull ubuntu:16.04) or the base
images from the container helps with this.

2 - There's another situation where the above doesn't help which seems to
be an unidentified issue with the docker cache:
https://github.com/docker/docker.github.io/issues/8886

We can get a short term workaround for #1 by explicitly pulling bases from
the script, but I think docker should do it when using --cache-from so
maybe contributing a patch to docker would the best approach.

Pedro

On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <aaron.s.markham@gmail.com>
wrote:

> When you create a new Dockerfile and use that on CI, it doesn't seem
> to cache some of the steps... like this:
>
> Step 13/15 : RUN /work/ubuntu_docs.sh
>  ---> Running in a1e522f3283b
>  [91m+ echo 'Installing dependencies...'
> + apt-get update
>  [0mInstalling dependencies.
>
> Or this....
>
> Step 4/13 : RUN /work/ubuntu_core.sh
>  ---> Running in e7882d7aa750
>  [91m+ apt-get update
>
> I get if I was changing those scripts, but then I'd think it should
> cache after running it once... but, no.
>
>
> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <marco.g.abreu@gmail.com>
> wrote:
> >
> > Do I understand it correctly that you are saying that the Docker cache
> > doesn't work properly and regularly reinstalls dependencies? Or do you
> mean
> > that you only have cache misses when you modify the dependencies - which
> > would be expected?
> >
> > -Marco
> >
> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> aaron.s.markham@gmail.com>
> > wrote:
> >
> > > Many of the CI pipelines follow this pattern:
> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > > repeat steps 1-3 over and over?
> > >
> > > Now, some tests use a stashed binary and docker cache. And I see this
> work
> > > locally, but for the most part, on CI, you're gonna sit through a
> > > dependency install.
> > >
> > > I noticed that almost all jobs use an ubuntu setup that is fully
> loaded.
> > > Without cache, it can take 10 or more minutes to build.  So I made a
> lite
> > > version. Takes only a few minutes instead.
> > >
> > > In some cases archiving worked great to share across pipelines, but as
> > > Marco mentioned we need a storage solution to make that happen. We
> can't
> > > archive every intermediate artifact for each PR.
> > >
> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> aaron.s.markham@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > The PRs Thomas and I are working on for the new docs and website
> share
> > > > the
> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > > > >
> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivier01@gmail.com>
> > > wrote:
> > > > >
> > > > > > I see it done daily now, and while I can’t share all the details,
> > > it’s
> > > > > not
> > > > > > an incredibly complex thing, and involves not much more than
> nfs/efs
> > > > > > sharing and remote ssh commands.  All it takes is a little
> ingenuity
> > > > and
> > > > > > some imagination.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sounds good in theory. I think there are complex details
with
> > > regards
> > > > > of
> > > > > > > resource sharing during parallel execution. Still I think
both
> ways
> > > > can
> > > > > > be
> > > > > > > explored. I think some tests run for unreasonably long
times
> for
> > > what
> > > > > > they
> > > > > > > are doing. We already scale parts of the pipeline horizontally
> > > across
> > > > > > > workers.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > > cjolivier01@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Rather than remove tests (which doesn’t scale as
a
> solution), why
> > > > not
> > > > > > > scale
> > > > > > > > them horizontally so that they finish more quickly?
Across
> > > > processes
> > > > > or
> > > > > > > > even on a pool of machines that aren’t necessarily
the build
> > > > machine?
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > With regards to time I rather prefer us spending
a bit more
> > > time
> > > > on
> > > > > > > > > maintenance than somebody running into an error
that
> could've
> > > > been
> > > > > > > caught
> > > > > > > > > with a test.
> > > > > > > > >
> > > > > > > > > I mean, our Publishing pipeline for Scala GPU
has been
> broken
> > > for
> > > > > > quite
> > > > > > > > > some time now, but nobody noticed that. Basically
my
> stance on
> > > > that
> > > > > > > > matter
> > > > > > > > > is that as soon as something is not blocking,
you can also
> just
> > > > > > > > deactivate
> > > > > > > > > it since you don't have a forcing function in
an open
> source
> > > > > project.
> > > > > > > > > People will rarely come back and fix the errors
of some
> nightly
> > > > > test
> > > > > > > that
> > > > > > > > > they introduced.
> > > > > > > > >
> > > > > > > > > -Marco
> > > > > > > > >
> > > > > > > > > Carin Meier <carinmeier@gmail.com> schrieb
am Mi., 14.
> Aug.
> > > > 2019,
> > > > > > > 21:59:
> > > > > > > > >
> > > > > > > > > > If a language binding test is failing for
a not important
> > > > reason,
> > > > > > > then
> > > > > > > > it
> > > > > > > > > > is too brittle and needs to be fixed (we
have fixed some
> of
> > > > these
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > Clojure package [1]).
> > > > > > > > > > But in general, if we thinking of the MXNet
project as
> one
> > > > > project
> > > > > > > that
> > > > > > > > > is
> > > > > > > > > > across all the language bindings, then we
want to know if
> > > some
> > > > > > > > > fundamental
> > > > > > > > > > code change is going to break a downstream
package.
> > > > > > > > > > I can't speak for all the high level package
binding
> > > > maintainers,
> > > > > > but
> > > > > > > > I'm
> > > > > > > > > > always happy to pitch in to provide code
fixes to help
> the
> > > base
> > > > > PR
> > > > > > > get
> > > > > > > > > > green.
> > > > > > > > > >
> > > > > > > > > > The time costs to maintain such a large
CI project
> obviously
> > > > > needs
> > > > > > to
> > > > > > > > be
> > > > > > > > > > considered as well.
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy
<
> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From what I have seen Clojure is 15
minutes, which I
> think
> > > is
> > > > > > > > > reasonable.
> > > > > > > > > > > The only question is that when a binding
such as R,
> Perl or
> > > > > > Clojure
> > > > > > > > > > fails,
> > > > > > > > > > > some devs are a bit confused about
how to fix them
> since
> > > they
> > > > > are
> > > > > > > not
> > > > > > > > > > > familiar with the testing tools and
the language.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin
Meier <
> > > > > > carinmeier@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Great idea Marco! Anything that
you think would be
> > > valuable
> > > > > to
> > > > > > > > share
> > > > > > > > > > > would
> > > > > > > > > > > > be good. The duration of each
node in the test stage
> > > sounds
> > > > > > like
> > > > > > > a
> > > > > > > > > good
> > > > > > > > > > > > start.
> > > > > > > > > > > >
> > > > > > > > > > > > - Carin
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM
Marco de Abreu <
> > > > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > we record a bunch of metrics
about run statistics
> (down
> > > > to
> > > > > > the
> > > > > > > > > > duration
> > > > > > > > > > > > of
> > > > > > > > > > > > > every individual step). If
you tell me which ones
> > > you're
> > > > > > > > > particularly
> > > > > > > > > > > > > interested in (probably total
duration of each
> node in
> > > > the
> > > > > > test
> > > > > > > > > > stage),
> > > > > > > > > > > > I'm
> > > > > > > > > > > > > happy to provide them.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dimensions are (in hierarchical
order):
> > > > > > > > > > > > > - job
> > > > > > > > > > > > > - branch
> > > > > > > > > > > > > - stage
> > > > > > > > > > > > > - node
> > > > > > > > > > > > > - step
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unfortunately I don't have
the possibility to
> export
> > > them
> > > > > > since
> > > > > > > > we
> > > > > > > > > > > store
> > > > > > > > > > > > > them in CloudWatch Metrics
which afaik doesn't
> offer
> > > raw
> > > > > > > exports.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Marco
> > > > > > > > > > > > >
> > > > > > > > > > > > > Carin Meier <carinmeier@gmail.com>
schrieb am
> Mi., 14.
> > > > > Aug.
> > > > > > > > 2019,
> > > > > > > > > > > 19:43:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I would prefer to keep
the language binding in
> the PR
> > > > > > > process.
> > > > > > > > > > > Perhaps
> > > > > > > > > > > > we
> > > > > > > > > > > > > > could do some analytics
to see how much each of
> the
> > > > > > language
> > > > > > > > > > bindings
> > > > > > > > > > > > is
> > > > > > > > > > > > > > contributing to overall
run time.
> > > > > > > > > > > > > > If we have some metrics
on that, maybe we can
> come up
> > > > > with
> > > > > > a
> > > > > > > > > > > guideline
> > > > > > > > > > > > of
> > > > > > > > > > > > > > how much time each should
take. Another
> possibility
> > > is
> > > > > > > leverage
> > > > > > > > > the
> > > > > > > > > > > > > > parallel builds more.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 14, 2019
at 1:30 PM Pedro Larroy <
> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a good point,
all things considered
> would
> > > your
> > > > > > > > > preference
> > > > > > > > > > be
> > > > > > > > > > > > to
> > > > > > > > > > > > > > keep
> > > > > > > > > > > > > > > the Clojure tests
as part of the PR process or
> in
> > > > > > Nightly?
> > > > > > > > > > > > > > > Some options are
having notifications here or
> in
> > > > slack.
> > > > > > But
> > > > > > > > if
> > > > > > > > > we
> > > > > > > > > > > > think
> > > > > > > > > > > > > > > breakages would
go unnoticed maybe is not a
> good
> > > idea
> > > > > to
> > > > > > > > fully
> > > > > > > > > > > remove
> > > > > > > > > > > > > > > bindings from the
PR process and just
> streamline
> > > the
> > > > > > > process.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 14,
2019 at 5:09 AM Carin Meier <
> > > > > > > > > > carinmeier@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Before any
binding tests are moved to
> nightly, I
> > > > > think
> > > > > > we
> > > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > > > > figure
> > > > > > > > > > > > > > > > out how the
community can get proper
> > > notifications
> > > > of
> > > > > > > > failure
> > > > > > > > > > and
> > > > > > > > > > > > > > success
> > > > > > > > > > > > > > > > on those nightly
runs. Otherwise, I think
> that
> > > > > > breakages
> > > > > > > > > would
> > > > > > > > > > go
> > > > > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -Carin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Aug
13, 2019 at 7:47 PM Pedro Larroy
> <
> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Seems
we are hitting some problems in CI. I
> > > > propose
> > > > > > the
> > > > > > > > > > > following
> > > > > > > > > > > > > > > action
> > > > > > > > > > > > > > > > > items
to remedy the situation and
> accelerate
> > > turn
> > > > > > > around
> > > > > > > > > > times
> > > > > > > > > > > in
> > > > > > > > > > > > > CI,
> > > > > > > > > > > > > > > > > reduce
cost, complexity and probability of
> > > > failure
> > > > > > > > blocking
> > > > > > > > > > PRs
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > frustrating
developers:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > * Upgrade
Windows visual studio from VS
> 2015 to
> > > > VS
> > > > > > > 2017.
> > > > > > > > > The
> > > > > > > > > > > > > > > > > build_windows.py
infrastructure should
> easily
> > > > work
> > > > > > with
> > > > > > > > the
> > > > > > > > > > new
> > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > Currently
some PRs are blocked by this:
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > > > > * Move
Gluon Model zoo tests to nightly.
> > > Tracked
> > > > at
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > > > > * Move
non-python bindings tests to
> nightly.
> > > If a
> > > > > > > commit
> > > > > > > > is
> > > > > > > > > > > > > touching
> > > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > bindings,
the reviewer should ask for a
> full
> > > run
> > > > > > which
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > > > done
> > > > > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > > > > use the
label bot to trigger a full CI
> build,
> > > or
> > > > > > defer
> > > > > > > to
> > > > > > > > > > > > nightly.
> > > > > > > > > > > > > > > > > * Provide
a couple of basic sanity
> performance
> > > > > tests
> > > > > > on
> > > > > > > > > small
> > > > > > > > > > > > > models
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > are run
on CI and can be echoed by the
> label
> > > bot
> > > > > as a
> > > > > > > > > comment
> > > > > > > > > > > for
> > > > > > > > > > > > > > PRs.
> > > > > > > > > > > > > > > > > * Address
unit tests that take more than
> > > 10-20s,
> > > > > > > > streamline
> > > > > > > > > > > them
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > > them
to nightly if it can't be done.
> > > > > > > > > > > > > > > > > * Open
sourcing the remaining CI
> infrastructure
> > > > > > scripts
> > > > > > > > so
> > > > > > > > > > the
> > > > > > > > > > > > > > > community
> > > > > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think
our goal should be turnaround under
> > > > 30min.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I would
also like to touch base with the
> > > > community
> > > > > > that
> > > > > > > > > some
> > > > > > > > > > > PRs
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > being
followed up by committers asking for
> > > > changes.
> > > > > > For
> > > > > > > > > > example
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > PR
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > importtant
and is hanging for a long time.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This
is another, less important but more
> > > trivial
> > > > to
> > > > > > > > review:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think
comitters requesting changes and
> not
> > > > > > folllowing
> > > > > > > > up
> > > > > > > > > in
> > > > > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > > > > time
is not healthy for the project. I
> suggest
> > > > > > > > configuring
> > > > > > > > > > > github
> > > > > > > > > > > > > > > > > Notifications
for a good SNR and following
> up.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message