mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@gmail.com>
Subject Re: CI and PRs
Date Thu, 15 Aug 2019 17:57:54 GMT
No worries, auto scaling is taking care of that :)

-Marco

Sheng Zha <szha.pvg@gmail.com> schrieb am Do., 15. Aug. 2019, 19:50:

> The AWS Batch approach should also help with hardware utilization as
> machines are launched only when needed :)
>
> -sz
>
> > On Aug 15, 2019, at 9:11 AM, Marco de Abreu <marco.g.abreu@gmail.com>
> wrote:
> >
> > Thanks Leonard. Naively dividing by test files would certainly be an easy
> > and doable way before going into to proper nose parallelization. Great
> idea!
> >
> > Scalability in terms of nodes is not an issue. Our system can handle at
> > least 600 slaves (didn't want to go higher for obvious reasons). But I
> > think we don't even have to go that far because most of the time, our
> > machines are heavily under utilized due to the single-threaded nature of
> > most tests. Thus, parallel test execution on the same machine would
> already
> > speed up the process by great lengths.
> >
> > -Marco
> >
> > P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
> > my approach is considered helpful :)
> >
> > Leonard Lausen <l-software@lausen.nl> schrieb am Do., 15. Aug. 2019,
> 18:59:
> >
> >> To parallelize across machines: For GluonNLP we started submitting test
> >> jobs to AWS Batch. Just adding a for-loop over the units in the
> >> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
> >> Jenkins just waits for all jobs to finish and retrieves their status.
> >> This works since AWS Batch added GPU support this April [3].
> >>
> >> For MXNet, naively parallelizing over the files defining the test cases
> >> that are in the longest running Pipeline stage may already help?
> >>
> >> [1]:
> >>
> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
> >> [2]:
> https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
> >> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
> >>
> >> Marco de Abreu <marco.g.abreu@gmail.com> writes:
> >>
> >>> The first start wrt parallelization could certainly be start adding
> >>> parallel test execution in nosetests.
> >>>
> >>> -Marco
> >>>
> >>> Aaron Markham <aaron.s.markham@gmail.com> schrieb am Do., 15. Aug.
> 2019,
> >>> 05:39:
> >>>
> >>>> The PRs Thomas and I are working on for the new docs and website share
> >> the
> >>>> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >>>>
> >>>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivier01@gmail.com>
> >> wrote:
> >>>>
> >>>>> I see it done daily now, and while I can’t share all the details,
> it’s
> >>>> not
> >>>>> an incredibly complex thing, and involves not much more than nfs/efs
> >>>>> sharing and remote ssh commands.  All it takes is a little ingenuity
> >> and
> >>>>> some imagination.
> >>>>>
> >>>>> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> >>>> pedro.larroy.lists@gmail.com
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Sounds good in theory. I think there are complex details with
> >> regards
> >>>> of
> >>>>>> resource sharing during parallel execution. Still I think both
ways
> >> can
> >>>>> be
> >>>>>> explored. I think some tests run for unreasonably long times
for
> >> what
> >>>>> they
> >>>>>> are doing. We already scale parts of the pipeline horizontally
> >> across
> >>>>>> workers.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> >> cjolivier01@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> +1
> >>>>>>>
> >>>>>>> Rather than remove tests (which doesn’t scale as a solution),
why
> >> not
> >>>>>> scale
> >>>>>>> them horizontally so that they finish more quickly? Across
> >> processes
> >>>> or
> >>>>>>> even on a pool of machines that aren’t necessarily the
build
> >> machine?
> >>>>>>>
> >>>>>>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> >>>>> marco.g.abreu@gmail.com
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> With regards to time I rather prefer us spending a bit
more
> >> time on
> >>>>>>>> maintenance than somebody running into an error that
could've
> >> been
> >>>>>> caught
> >>>>>>>> with a test.
> >>>>>>>>
> >>>>>>>> I mean, our Publishing pipeline for Scala GPU has been
broken
> >> for
> >>>>> quite
> >>>>>>>> some time now, but nobody noticed that. Basically my
stance on
> >> that
> >>>>>>> matter
> >>>>>>>> is that as soon as something is not blocking, you can
also just
> >>>>>>> deactivate
> >>>>>>>> it since you don't have a forcing function in an open
source
> >>>> project.
> >>>>>>>> People will rarely come back and fix the errors of some
nightly
> >>>> test
> >>>>>> that
> >>>>>>>> they introduced.
> >>>>>>>>
> >>>>>>>> -Marco
> >>>>>>>>
> >>>>>>>> Carin Meier <carinmeier@gmail.com> schrieb am
Mi., 14. Aug.
> >> 2019,
> >>>>>> 21:59:
> >>>>>>>>
> >>>>>>>>> If a language binding test is failing for a not
important
> >> reason,
> >>>>>> then
> >>>>>>> it
> >>>>>>>>> is too brittle and needs to be fixed (we have fixed
some of
> >> these
> >>>>>> with
> >>>>>>>> the
> >>>>>>>>> Clojure package [1]).
> >>>>>>>>> But in general, if we thinking of the MXNet project
as one
> >>>> project
> >>>>>> that
> >>>>>>>> is
> >>>>>>>>> across all the language bindings, then we want to
know if some
> >>>>>>>> fundamental
> >>>>>>>>> code change is going to break a downstream package.
> >>>>>>>>> I can't speak for all the high level package binding
> >> maintainers,
> >>>>> but
> >>>>>>> I'm
> >>>>>>>>> always happy to pitch in to provide code fixes to
help the
> >> base
> >>>> PR
> >>>>>> get
> >>>>>>>>> green.
> >>>>>>>>>
> >>>>>>>>> The time costs to maintain such a large CI project
obviously
> >>>> needs
> >>>>> to
> >>>>>>> be
> >>>>>>>>> considered as well.
> >>>>>>>>>
> >>>>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15579
> >>>>>>>>>
> >>>>>>>>> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> >>>>>>>> pedro.larroy.lists@gmail.com
> >>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> From what I have seen Clojure is 15 minutes,
which I think
> >> is
> >>>>>>>> reasonable.
> >>>>>>>>>> The only question is that when a binding such
as R, Perl or
> >>>>> Clojure
> >>>>>>>>> fails,
> >>>>>>>>>> some devs are a bit confused about how to fix
them since
> >> they
> >>>> are
> >>>>>> not
> >>>>>>>>>> familiar with the testing tools and the language.
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Aug 14, 2019 at 11:57 AM Carin Meier
<
> >>>>> carinmeier@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Great idea Marco! Anything that you think
would be
> >> valuable
> >>>> to
> >>>>>>> share
> >>>>>>>>>> would
> >>>>>>>>>>> be good. The duration of each node in the
test stage
> >> sounds
> >>>>> like
> >>>>>> a
> >>>>>>>> good
> >>>>>>>>>>> start.
> >>>>>>>>>>>
> >>>>>>>>>>> - Carin
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Aug 14, 2019 at 2:48 PM Marco de
Abreu <
> >>>>>>>>> marco.g.abreu@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> we record a bunch of metrics about run
statistics (down
> >> to
> >>>>> the
> >>>>>>>>> duration
> >>>>>>>>>>> of
> >>>>>>>>>>>> every individual step). If you tell
me which ones you're
> >>>>>>>> particularly
> >>>>>>>>>>>> interested in (probably total duration
of each node in
> >> the
> >>>>> test
> >>>>>>>>> stage),
> >>>>>>>>>>> I'm
> >>>>>>>>>>>> happy to provide them.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dimensions are (in hierarchical order):
> >>>>>>>>>>>> - job
> >>>>>>>>>>>> - branch
> >>>>>>>>>>>> - stage
> >>>>>>>>>>>> - node
> >>>>>>>>>>>> - step
> >>>>>>>>>>>>
> >>>>>>>>>>>> Unfortunately I don't have the possibility
to export
> >> them
> >>>>> since
> >>>>>>> we
> >>>>>>>>>> store
> >>>>>>>>>>>> them in CloudWatch Metrics which afaik
doesn't offer raw
> >>>>>> exports.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>> Marco
> >>>>>>>>>>>>
> >>>>>>>>>>>> Carin Meier <carinmeier@gmail.com>
schrieb am Mi., 14.
> >>>> Aug.
> >>>>>>> 2019,
> >>>>>>>>>> 19:43:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I would prefer to keep the language
binding in the PR
> >>>>>> process.
> >>>>>>>>>> Perhaps
> >>>>>>>>>>> we
> >>>>>>>>>>>>> could do some analytics to see how
much each of the
> >>>>> language
> >>>>>>>>> bindings
> >>>>>>>>>>> is
> >>>>>>>>>>>>> contributing to overall run time.
> >>>>>>>>>>>>> If we have some metrics on that,
maybe we can come up
> >>>> with
> >>>>> a
> >>>>>>>>>> guideline
> >>>>>>>>>>> of
> >>>>>>>>>>>>> how much time each should take.
Another possibility is
> >>>>>> leverage
> >>>>>>>> the
> >>>>>>>>>>>>> parallel builds more.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Aug 14, 2019 at 1:30 PM
Pedro Larroy <
> >>>>>>>>>>>> pedro.larroy.lists@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Carin.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> That's a good point, all things
considered would
> >> your
> >>>>>>>> preference
> >>>>>>>>> be
> >>>>>>>>>>> to
> >>>>>>>>>>>>> keep
> >>>>>>>>>>>>>> the Clojure tests as part of
the PR process or in
> >>>>> Nightly?
> >>>>>>>>>>>>>> Some options are having notifications
here or in
> >> slack.
> >>>>> But
> >>>>>>> if
> >>>>>>>> we
> >>>>>>>>>>> think
> >>>>>>>>>>>>>> breakages would go unnoticed
maybe is not a good
> >> idea
> >>>> to
> >>>>>>> fully
> >>>>>>>>>> remove
> >>>>>>>>>>>>>> bindings from the PR process
and just streamline the
> >>>>>> process.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Pedro.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 5:09
AM Carin Meier <
> >>>>>>>>> carinmeier@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Before any binding tests
are moved to nightly, I
> >>>> think
> >>>>> we
> >>>>>>>> need
> >>>>>>>>> to
> >>>>>>>>>>>>> figure
> >>>>>>>>>>>>>>> out how the community can
get proper
> >> notifications of
> >>>>>>> failure
> >>>>>>>>> and
> >>>>>>>>>>>>> success
> >>>>>>>>>>>>>>> on those nightly runs. Otherwise,
I think that
> >>>>> breakages
> >>>>>>>> would
> >>>>>>>>> go
> >>>>>>>>>>>>>>> unnoticed.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Carin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Aug 13, 2019 at
7:47 PM Pedro Larroy <
> >>>>>>>>>>>>>> pedro.larroy.lists@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Seems we are hitting
some problems in CI. I
> >> propose
> >>>>> the
> >>>>>>>>>> following
> >>>>>>>>>>>>>> action
> >>>>>>>>>>>>>>>> items to remedy the
situation and accelerate
> >> turn
> >>>>>> around
> >>>>>>>>> times
> >>>>>>>>>> in
> >>>>>>>>>>>> CI,
> >>>>>>>>>>>>>>>> reduce cost, complexity
and probability of
> >> failure
> >>>>>>> blocking
> >>>>>>>>> PRs
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> frustrating developers:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> * Upgrade Windows visual
studio from VS 2015 to
> >> VS
> >>>>>> 2017.
> >>>>>>>> The
> >>>>>>>>>>>>>>>> build_windows.py infrastructure
should easily
> >> work
> >>>>> with
> >>>>>>> the
> >>>>>>>>> new
> >>>>>>>>>>>>>> version.
> >>>>>>>>>>>>>>>> Currently some PRs are
blocked by this:
> >>>>>>>>>>>>>>>>
> >>>>> https://github.com/apache/incubator-mxnet/issues/13958
> >>>>>>>>>>>>>>>> * Move Gluon Model zoo
tests to nightly.
> >> Tracked at
> >>>>>>>>>>>>>>>>
> >>>>> https://github.com/apache/incubator-mxnet/issues/15295
> >>>>>>>>>>>>>>>> * Move non-python bindings
tests to nightly. If
> >> a
> >>>>>> commit
> >>>>>>> is
> >>>>>>>>>>>> touching
> >>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>> bindings, the reviewer
should ask for a full run
> >>>>> which
> >>>>>>> can
> >>>>>>>> be
> >>>>>>>>>>> done
> >>>>>>>>>>>>>>> locally,
> >>>>>>>>>>>>>>>> use the label bot to
trigger a full CI build, or
> >>>>> defer
> >>>>>> to
> >>>>>>>>>>> nightly.
> >>>>>>>>>>>>>>>> * Provide a couple of
basic sanity performance
> >>>> tests
> >>>>> on
> >>>>>>>> small
> >>>>>>>>>>>> models
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> are run on CI and can
be echoed by the label bot
> >>>> as a
> >>>>>>>> comment
> >>>>>>>>>> for
> >>>>>>>>>>>>> PRs.
> >>>>>>>>>>>>>>>> * Address unit tests
that take more than 10-20s,
> >>>>>>> streamline
> >>>>>>>>>> them
> >>>>>>>>>>> or
> >>>>>>>>>>>>>> move
> >>>>>>>>>>>>>>>> them to nightly if it
can't be done.
> >>>>>>>>>>>>>>>> * Open sourcing the
remaining CI infrastructure
> >>>>> scripts
> >>>>>>> so
> >>>>>>>>> the
> >>>>>>>>>>>>>> community
> >>>>>>>>>>>>>>>> can contribute.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think our goal should
be turnaround under
> >> 30min.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I would also like to
touch base with the
> >> community
> >>>>> that
> >>>>>>>> some
> >>>>>>>>>> PRs
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>> being followed up by
committers asking for
> >> changes.
> >>>>> For
> >>>>>>>>> example
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> PR
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> importtant and is hanging
for a long time.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>> https://github.com/apache/incubator-mxnet/pull/15051
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is another, less
important but more
> >> trivial to
> >>>>>>> review:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>> https://github.com/apache/incubator-mxnet/pull/14940
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think comitters requesting
changes and not
> >>>>> folllowing
> >>>>>>> up
> >>>>>>>> in
> >>>>>>>>>>>>>> reasonable
> >>>>>>>>>>>>>>>> time is not healthy
for the project. I suggest
> >>>>>>> configuring
> >>>>>>>>>> github
> >>>>>>>>>>>>>>>> Notifications for a
good SNR and following up.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Pedro.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message