mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Markham <aaron.s.mark...@gmail.com>
Subject Re: CI and PRs
Date Thu, 15 Aug 2019 22:47:51 GMT
Many of the CI pipelines follow this pattern:
Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
repeat steps 1-3 over and over?

Now, some tests use a stashed binary and docker cache. And I see this work
locally, but for the most part, on CI, you're gonna sit through a
dependency install.

I noticed that almost all jobs use an ubuntu setup that is fully loaded.
Without cache, it can take 10 or more minutes to build.  So I made a lite
version. Takes only a few minutes instead.

In some cases archiving worked great to share across pipelines, but as
Marco mentioned we need a storage solution to make that happen. We can't
archive every intermediate artifact for each PR.

On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> Hi Aaron. Why speeds things up? What's the difference?
>
> Pedro.
>
> On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <aaron.s.markham@gmail.com>
> wrote:
>
> > The PRs Thomas and I are working on for the new docs and website share
> the
> > mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >
> > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivier01@gmail.com> wrote:
> >
> > > I see it done daily now, and while I can’t share all the details, it’s
> > not
> > > an incredibly complex thing, and involves not much more than nfs/efs
> > > sharing and remote ssh commands.  All it takes is a little ingenuity
> and
> > > some imagination.
> > >
> > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > Sounds good in theory. I think there are complex details with regards
> > of
> > > > resource sharing during parallel execution. Still I think both ways
> can
> > > be
> > > > explored. I think some tests run for unreasonably long times for what
> > > they
> > > > are doing. We already scale parts of the pipeline horizontally across
> > > > workers.
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> cjolivier01@apache.org>
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Rather than remove tests (which doesn’t scale as a solution), why
> not
> > > > scale
> > > > > them horizontally so that they finish more quickly? Across
> processes
> > or
> > > > > even on a pool of machines that aren’t necessarily the build
> machine?
> > > > >
> > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > marco.g.abreu@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > With regards to time I rather prefer us spending a bit more
time
> on
> > > > > > maintenance than somebody running into an error that could've
> been
> > > > caught
> > > > > > with a test.
> > > > > >
> > > > > > I mean, our Publishing pipeline for Scala GPU has been broken
for
> > > quite
> > > > > > some time now, but nobody noticed that. Basically my stance
on
> that
> > > > > matter
> > > > > > is that as soon as something is not blocking, you can also just
> > > > > deactivate
> > > > > > it since you don't have a forcing function in an open source
> > project.
> > > > > > People will rarely come back and fix the errors of some nightly
> > test
> > > > that
> > > > > > they introduced.
> > > > > >
> > > > > > -Marco
> > > > > >
> > > > > > Carin Meier <carinmeier@gmail.com> schrieb am Mi., 14.
Aug.
> 2019,
> > > > 21:59:
> > > > > >
> > > > > > > If a language binding test is failing for a not important
> reason,
> > > > then
> > > > > it
> > > > > > > is too brittle and needs to be fixed (we have fixed some
of
> these
> > > > with
> > > > > > the
> > > > > > > Clojure package [1]).
> > > > > > > But in general, if we thinking of the MXNet project as
one
> > project
> > > > that
> > > > > > is
> > > > > > > across all the language bindings, then we want to know
if some
> > > > > > fundamental
> > > > > > > code change is going to break a downstream package.
> > > > > > > I can't speak for all the high level package binding
> maintainers,
> > > but
> > > > > I'm
> > > > > > > always happy to pitch in to provide code fixes to help
the base
> > PR
> > > > get
> > > > > > > green.
> > > > > > >
> > > > > > > The time costs to maintain such a large CI project obviously
> > needs
> > > to
> > > > > be
> > > > > > > considered as well.
> > > > > > >
> > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > pedro.larroy.lists@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > From what I have seen Clojure is 15 minutes, which
I think is
> > > > > > reasonable.
> > > > > > > > The only question is that when a binding such as R,
Perl or
> > > Clojure
> > > > > > > fails,
> > > > > > > > some devs are a bit confused about how to fix them
since they
> > are
> > > > not
> > > > > > > > familiar with the testing tools and the language.
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > carinmeier@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Great idea Marco! Anything that you think would
be valuable
> > to
> > > > > share
> > > > > > > > would
> > > > > > > > > be good. The duration of each node in the test
stage sounds
> > > like
> > > > a
> > > > > > good
> > > > > > > > > start.
> > > > > > > > >
> > > > > > > > > - Carin
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu
<
> > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > we record a bunch of metrics about run statistics
(down
> to
> > > the
> > > > > > > duration
> > > > > > > > > of
> > > > > > > > > > every individual step). If you tell me which
ones you're
> > > > > > particularly
> > > > > > > > > > interested in (probably total duration of
each node in
> the
> > > test
> > > > > > > stage),
> > > > > > > > > I'm
> > > > > > > > > > happy to provide them.
> > > > > > > > > >
> > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > - job
> > > > > > > > > > - branch
> > > > > > > > > > - stage
> > > > > > > > > > - node
> > > > > > > > > > - step
> > > > > > > > > >
> > > > > > > > > > Unfortunately I don't have the possibility
to export them
> > > since
> > > > > we
> > > > > > > > store
> > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
offer raw
> > > > exports.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Marco
> > > > > > > > > >
> > > > > > > > > > Carin Meier <carinmeier@gmail.com>
schrieb am Mi., 14.
> > Aug.
> > > > > 2019,
> > > > > > > > 19:43:
> > > > > > > > > >
> > > > > > > > > > > I would prefer to keep the language
binding in the PR
> > > > process.
> > > > > > > > Perhaps
> > > > > > > > > we
> > > > > > > > > > > could do some analytics to see how
much each of the
> > > language
> > > > > > > bindings
> > > > > > > > > is
> > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > If we have some metrics on that, maybe
we can come up
> > with
> > > a
> > > > > > > > guideline
> > > > > > > > > of
> > > > > > > > > > > how much time each should take. Another
possibility is
> > > > leverage
> > > > > > the
> > > > > > > > > > > parallel builds more.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
Larroy <
> > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > >
> > > > > > > > > > > > That's a good point, all things
considered would your
> > > > > > preference
> > > > > > > be
> > > > > > > > > to
> > > > > > > > > > > keep
> > > > > > > > > > > > the Clojure tests as part of the
PR process or in
> > > Nightly?
> > > > > > > > > > > > Some options are having notifications
here or in
> slack.
> > > But
> > > > > if
> > > > > > we
> > > > > > > > > think
> > > > > > > > > > > > breakages would go unnoticed maybe
is not a good idea
> > to
> > > > > fully
> > > > > > > > remove
> > > > > > > > > > > > bindings from the PR process and
just streamline the
> > > > process.
> > > > > > > > > > > >
> > > > > > > > > > > > Pedro.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM
Carin Meier <
> > > > > > > carinmeier@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Before any binding tests
are moved to nightly, I
> > think
> > > we
> > > > > > need
> > > > > > > to
> > > > > > > > > > > figure
> > > > > > > > > > > > > out how the community can
get proper notifications
> of
> > > > > failure
> > > > > > > and
> > > > > > > > > > > success
> > > > > > > > > > > > > on those nightly runs. Otherwise,
I think that
> > > breakages
> > > > > > would
> > > > > > > go
> > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Carin
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47
PM Pedro Larroy <
> > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Seems we are hitting
some problems in CI. I
> propose
> > > the
> > > > > > > > following
> > > > > > > > > > > > action
> > > > > > > > > > > > > > items to remedy the
situation and accelerate turn
> > > > around
> > > > > > > times
> > > > > > > > in
> > > > > > > > > > CI,
> > > > > > > > > > > > > > reduce cost, complexity
and probability of
> failure
> > > > > blocking
> > > > > > > PRs
> > > > > > > > > and
> > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Upgrade Windows visual
studio from VS 2015 to
> VS
> > > > 2017.
> > > > > > The
> > > > > > > > > > > > > > build_windows.py infrastructure
should easily
> work
> > > with
> > > > > the
> > > > > > > new
> > > > > > > > > > > > version.
> > > > > > > > > > > > > > Currently some PRs are
blocked by this:
> > > > > > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > * Move Gluon Model zoo
tests to nightly. Tracked
> at
> > > > > > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > * Move non-python bindings
tests to nightly. If a
> > > > commit
> > > > > is
> > > > > > > > > > touching
> > > > > > > > > > > > > other
> > > > > > > > > > > > > > bindings, the reviewer
should ask for a full run
> > > which
> > > > > can
> > > > > > be
> > > > > > > > > done
> > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > use the label bot to
trigger a full CI build, or
> > > defer
> > > > to
> > > > > > > > > nightly.
> > > > > > > > > > > > > > * Provide a couple of
basic sanity performance
> > tests
> > > on
> > > > > > small
> > > > > > > > > > models
> > > > > > > > > > > > that
> > > > > > > > > > > > > > are run on CI and can
be echoed by the label bot
> > as a
> > > > > > comment
> > > > > > > > for
> > > > > > > > > > > PRs.
> > > > > > > > > > > > > > * Address unit tests
that take more than 10-20s,
> > > > > streamline
> > > > > > > > them
> > > > > > > > > or
> > > > > > > > > > > > move
> > > > > > > > > > > > > > them to nightly if it
can't be done.
> > > > > > > > > > > > > > * Open sourcing the
remaining CI infrastructure
> > > scripts
> > > > > so
> > > > > > > the
> > > > > > > > > > > > community
> > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think our goal should
be turnaround under
> 30min.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would also like to
touch base with the
> community
> > > that
> > > > > > some
> > > > > > > > PRs
> > > > > > > > > > are
> > > > > > > > > > > > not
> > > > > > > > > > > > > > being followed up by
committers asking for
> changes.
> > > For
> > > > > > > example
> > > > > > > > > > this
> > > > > > > > > > > PR
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > importtant and is hanging
for a long time.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is another, less
important but more trivial
> to
> > > > > review:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think comitters requesting
changes and not
> > > folllowing
> > > > > up
> > > > > > in
> > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > time is not healthy
for the project. I suggest
> > > > > configuring
> > > > > > > > github
> > > > > > > > > > > > > > Notifications for a
good SNR and following up.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message