mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhao, Patric" <patric.z...@intel.com>
Subject new website (RE: CI and PRs)
Date Thu, 15 Aug 2019 05:02:55 GMT
Hi Aaron,

Recently, we are working on improving the documents of CPU backend based on the current website.

I saw there're several PRs to update the new website and it's really great.

Thus, I'd like to know when the new website will online. 
If it's very near, we will switch our works to the new website.

Thanks,

--Patric


> -----Original Message-----
> From: Aaron Markham <aaron.s.markham@gmail.com>
> Sent: Thursday, August 15, 2019 11:40 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: CI and PRs
> 
> The PRs Thomas and I are working on for the new docs and website share
> the mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> 
> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivier01@gmail.com> wrote:
> 
> > I see it done daily now, and while I can’t share all the details, it’s
> > not an incredibly complex thing, and involves not much more than
> > nfs/efs sharing and remote ssh commands.  All it takes is a little
> > ingenuity and some imagination.
> >
> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy
> > <pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Sounds good in theory. I think there are complex details with
> > > regards of resource sharing during parallel execution. Still I think
> > > both ways can
> > be
> > > explored. I think some tests run for unreasonably long times for
> > > what
> > they
> > > are doing. We already scale parts of the pipeline horizontally
> > > across workers.
> > >
> > >
> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier
> > > <cjolivier01@apache.org>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Rather than remove tests (which doesn’t scale as a solution), why
> > > > not
> > > scale
> > > > them horizontally so that they finish more quickly? Across
> > > > processes or even on a pool of machines that aren’t necessarily the
> build machine?
> > > >
> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > marco.g.abreu@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > With regards to time I rather prefer us spending a bit more time
> > > > > on maintenance than somebody running into an error that could've
> > > > > been
> > > caught
> > > > > with a test.
> > > > >
> > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> > > > > for
> > quite
> > > > > some time now, but nobody noticed that. Basically my stance on
> > > > > that
> > > > matter
> > > > > is that as soon as something is not blocking, you can also just
> > > > deactivate
> > > > > it since you don't have a forcing function in an open source project.
> > > > > People will rarely come back and fix the errors of some nightly
> > > > > test
> > > that
> > > > > they introduced.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Carin Meier <carinmeier@gmail.com> schrieb am Mi., 14. Aug.
> > > > > 2019,
> > > 21:59:
> > > > >
> > > > > > If a language binding test is failing for a not important
> > > > > > reason,
> > > then
> > > > it
> > > > > > is too brittle and needs to be fixed (we have fixed some of
> > > > > > these
> > > with
> > > > > the
> > > > > > Clojure package [1]).
> > > > > > But in general, if we thinking of the MXNet project as one
> > > > > > project
> > > that
> > > > > is
> > > > > > across all the language bindings, then we want to know if some
> > > > > fundamental
> > > > > > code change is going to break a downstream package.
> > > > > > I can't speak for all the high level package binding
> > > > > > maintainers,
> > but
> > > > I'm
> > > > > > always happy to pitch in to provide code fixes to help the
> > > > > > base PR
> > > get
> > > > > > green.
> > > > > >
> > > > > > The time costs to maintain such a large CI project obviously
> > > > > > needs
> > to
> > > > be
> > > > > > considered as well.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > From what I have seen Clojure is 15 minutes, which I think
> > > > > > > is
> > > > > reasonable.
> > > > > > > The only question is that when a binding such as R, Perl
or
> > Clojure
> > > > > > fails,
> > > > > > > some devs are a bit confused about how to fix them since
> > > > > > > they are
> > > not
> > > > > > > familiar with the testing tools and the language.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > carinmeier@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Great idea Marco! Anything that you think would be
> > > > > > > > valuable to
> > > > share
> > > > > > > would
> > > > > > > > be good. The duration of each node in the test stage
> > > > > > > > sounds
> > like
> > > a
> > > > > good
> > > > > > > > start.
> > > > > > > >
> > > > > > > > - Carin
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > we record a bunch of metrics about run statistics
(down
> > > > > > > > > to
> > the
> > > > > > duration
> > > > > > > > of
> > > > > > > > > every individual step). If you tell me which
ones you're
> > > > > particularly
> > > > > > > > > interested in (probably total duration of each
node in
> > > > > > > > > the
> > test
> > > > > > stage),
> > > > > > > > I'm
> > > > > > > > > happy to provide them.
> > > > > > > > >
> > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > - job
> > > > > > > > > - branch
> > > > > > > > > - stage
> > > > > > > > > - node
> > > > > > > > > - step
> > > > > > > > >
> > > > > > > > > Unfortunately I don't have the possibility to
export
> > > > > > > > > them
> > since
> > > > we
> > > > > > > store
> > > > > > > > > them in CloudWatch Metrics which afaik doesn't
offer raw
> > > exports.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Marco
> > > > > > > > >
> > > > > > > > > Carin Meier <carinmeier@gmail.com> schrieb
am Mi., 14. Aug.
> > > > 2019,
> > > > > > > 19:43:
> > > > > > > > >
> > > > > > > > > > I would prefer to keep the language binding
in the PR
> > > process.
> > > > > > > Perhaps
> > > > > > > > we
> > > > > > > > > > could do some analytics to see how much
each of the
> > language
> > > > > > bindings
> > > > > > > > is
> > > > > > > > > > contributing to overall run time.
> > > > > > > > > > If we have some metrics on that, maybe we
can come up
> > > > > > > > > > with
> > a
> > > > > > > guideline
> > > > > > > > of
> > > > > > > > > > how much time each should take. Another
possibility is
> > > leverage
> > > > > the
> > > > > > > > > > parallel builds more.
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy
<
> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Carin.
> > > > > > > > > > >
> > > > > > > > > > > That's a good point, all things considered
would
> > > > > > > > > > > your
> > > > > preference
> > > > > > be
> > > > > > > > to
> > > > > > > > > > keep
> > > > > > > > > > > the Clojure tests as part of the PR
process or in
> > Nightly?
> > > > > > > > > > > Some options are having notifications
here or in slack.
> > But
> > > > if
> > > > > we
> > > > > > > > think
> > > > > > > > > > > breakages would go unnoticed maybe
is not a good
> > > > > > > > > > > idea to
> > > > fully
> > > > > > > remove
> > > > > > > > > > > bindings from the PR process and just
streamline the
> > > process.
> > > > > > > > > > >
> > > > > > > > > > > Pedro.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin
Meier <
> > > > > > carinmeier@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Before any binding tests are moved
to nightly, I
> > > > > > > > > > > > think
> > we
> > > > > need
> > > > > > to
> > > > > > > > > > figure
> > > > > > > > > > > > out how the community can get
proper notifications
> > > > > > > > > > > > of
> > > > failure
> > > > > > and
> > > > > > > > > > success
> > > > > > > > > > > > on those nightly runs. Otherwise,
I think that
> > breakages
> > > > > would
> > > > > > go
> > > > > > > > > > > > unnoticed.
> > > > > > > > > > > >
> > > > > > > > > > > > -Carin
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM
Pedro Larroy <
> > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi
> > > > > > > > > > > > >
> > > > > > > > > > > > > Seems we are hitting some
problems in CI. I
> > > > > > > > > > > > > propose
> > the
> > > > > > > following
> > > > > > > > > > > action
> > > > > > > > > > > > > items to remedy the situation
and accelerate
> > > > > > > > > > > > > turn
> > > around
> > > > > > times
> > > > > > > in
> > > > > > > > > CI,
> > > > > > > > > > > > > reduce cost, complexity and
probability of
> > > > > > > > > > > > > failure
> > > > blocking
> > > > > > PRs
> > > > > > > > and
> > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > >
> > > > > > > > > > > > > * Upgrade Windows visual
studio from VS 2015 to
> > > > > > > > > > > > > VS
> > > 2017.
> > > > > The
> > > > > > > > > > > > > build_windows.py infrastructure
should easily
> > > > > > > > > > > > > work
> > with
> > > > the
> > > > > > new
> > > > > > > > > > > version.
> > > > > > > > > > > > > Currently some PRs are blocked
by this:
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > * Move Gluon Model zoo tests
to nightly. Tracked
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > * Move non-python bindings
tests to nightly. If
> > > > > > > > > > > > > a
> > > commit
> > > > is
> > > > > > > > > touching
> > > > > > > > > > > > other
> > > > > > > > > > > > > bindings, the reviewer should
ask for a full run
> > which
> > > > can
> > > > > be
> > > > > > > > done
> > > > > > > > > > > > locally,
> > > > > > > > > > > > > use the label bot to trigger
a full CI build, or
> > defer
> > > to
> > > > > > > > nightly.
> > > > > > > > > > > > > * Provide a couple of basic
sanity performance
> > > > > > > > > > > > > tests
> > on
> > > > > small
> > > > > > > > > models
> > > > > > > > > > > that
> > > > > > > > > > > > > are run on CI and can be
echoed by the label bot
> > > > > > > > > > > > > as a
> > > > > comment
> > > > > > > for
> > > > > > > > > > PRs.
> > > > > > > > > > > > > * Address unit tests that
take more than 10-20s,
> > > > streamline
> > > > > > > them
> > > > > > > > or
> > > > > > > > > > > move
> > > > > > > > > > > > > them to nightly if it can't
be done.
> > > > > > > > > > > > > * Open sourcing the remaining
CI infrastructure
> > scripts
> > > > so
> > > > > > the
> > > > > > > > > > > community
> > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think our goal should be
turnaround under 30min.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would also like to touch
base with the
> > > > > > > > > > > > > community
> > that
> > > > > some
> > > > > > > PRs
> > > > > > > > > are
> > > > > > > > > > > not
> > > > > > > > > > > > > being followed up by committers
asking for changes.
> > For
> > > > > > example
> > > > > > > > > this
> > > > > > > > > > PR
> > > > > > > > > > > > is
> > > > > > > > > > > > > importtant and is hanging
for a long time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/1
> > > > > > > > > > > > > 5051
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is another, less important
but more trivial
> > > > > > > > > > > > > to
> > > > review:
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/1
> > > > > > > > > > > > > 4940
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think comitters requesting
changes and not
> > folllowing
> > > > up
> > > > > in
> > > > > > > > > > > reasonable
> > > > > > > > > > > > > time is not healthy for the
project. I suggest
> > > > configuring
> > > > > > > github
> > > > > > > > > > > > > Notifications for a good
SNR and following up.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
Mime
View raw message