mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olivier <cjolivie...@gmail.com>
Subject Re: CI and PRs
Date Sat, 24 Aug 2019 02:43:31 GMT
Pedro,

I don’t see where Marco says that he “designed and implemented all aspects
of CI by himself”.  I do think, however, that it’s fair to say that Marco
was in charge of the design and most likely made the majority of design
decisions as the CI was being built, especially around those tenents that
he mentioned.  I know this because before I submitted Marco as a committer,
I asked some his teammates whether Marco was really responsible for CI and
the answer by all I asked were that CI was Marco's baby and he did most of
it by some large margin (I am paraphrasing).  Taking other design inputs
and examples (i.e. Apache CI) is all part of any responsible design process.

In addition, I am not understanding the obfuscation of “people who
contributed to CI”, “person/people who designed CI”, or even
"person who oversees CI" as it is weaponized in your email.  Again, nowhere
did Marco say that he did everything back then or since then.  I don't
think it's fair to try to modify what Marco wrote and then try to turn it
against him.  Reminds me of the techniques of network news these days,
quite frankly (whichever side you're "on" doesn't matter, because both
sides do it).

-Chris





On Fri, Aug 23, 2019 at 3:56 PM Pedro Larroy <pedro.larroy.lists@gmail.com>
wrote:

> Thanks for your response Marco, I think you have totally missed my original
> point which was basically that someone volunteering effort on the CI is as
> important as someone contributing a feature. From my perspective this
> hasn't been the case, and we had to rely a lot on you and Sheng to submit
> fixes which required access, also to relay communication with Apache infra.
> Also in many cases we had to rely on you to channel fixes, PRs, disable
> tests etc. If the community is fine having this kind of bottleneck, fine
> with me. From my point of view and the feedback from myself and other
> people which contributed to CI this was not always a good experience.
> Having a welcoming and inclusive community is very important. I don't want
> to start a discussion on this, but invite the community to do a bit of soul
> searching on this topic, now that the infrastructure is open source.
>
> Also I find surprising that you claim that you designed the CI yourself,
> when this was a joint work of many individuals, including the old Apache CI
> and additional contributions and code reviewers, people who were oncall for
> this service or the autoscaling approach which if I remember correctly came
> from a humble servant. Kellen did a lot of pair programming and code
> reviews. Obviously you have a done a lot of work on CI which has had a huge
> positive impact on the project and your recognition is well deserved. The
> technical details you mention on your email are perfectly true and valid.
>
> Below is a rough list of individuals who contributed to CI, I would like to
> thank all of them since without this work, we wouldn't be able to deliver
> with the quality that we have done in the past.
>
>
> pllarroy@mac:0: ~/d/m/ci [fc_higher_order_grad_2]> git log
> --pretty=format:%aN . | sort | uniq -c | sort -n | tail -n 10
>    6 Zach Kimberg
>    6 stu1130
>    7 Jake Lee
>    8 Aaron Markham
>   11 Lanking
>   12 Anton Chernov
>   13 perdasilva
>   26 Kellen Sunderland
>   34 Marco de Abreu
>   46 Pedro Larroy
>
> pllarroy@mac:0: ~/d/mxnet_ci_general [master]> git log --pretty=format:%aN
> | sort | uniq -c | sort -n
>    1 Gavin M. Bell
>    1 de Abreu
>    6 Bair
>    7 Kellen Sunderland
>    8 Jose Luis Contreras
>   14 perdasilva
>   20 Per Goncalves da Silva
>   29 Anton Chernov
>   39 Chance Bair
>   96 Pedro Larroy
>  209 Marco de Abreu
>
>
>
> Pedro.
>
> On Fri, Aug 23, 2019 at 3:18 PM Marco de Abreu <marco.g.abreu@gmail.com>
> wrote:
>
> > I've heard this request multiple times and so far, I'm having issues
> > understanding the direct correlation between having committer permissions
> > and being able to manage CI.
> >
> > When I designed the CI, one of the tenets was maintainability and
> > accessbility for the community: I wanted to avoid that somebody needs
> > certain privileges in order to execute regular actions. The result was
> the
> > strong usage of Jenkinsfiles, Dockerfiles and the runtime functions. The
> > combination of these techniques allowed somebody to create a job from the
> > process flow level (Jenkinsfile), over the environment level (Dockerfile)
> > to the individual action level (runtime functions). This design basically
> > gives the community full access over the entire flow.
> >
> > The jobs that are configured to source only Jenkinsfile. Jenkins
> supports a
> > lot of different ways how to define pipelines, but I have made sure to
> > encourage everybody to use only Jenkinsfiles. This makes sure that no
> > configuration is done in the web-interface. This firs of all alleviates
> the
> > permission issue since there's literally no config in the web interface
> and
> > second it allows auditing since all changes have to be done in the MXNet
> > GitHub repository.
> >
> > Committers have elevated permissions in Jenkins. These contain the
> > permission to run, stop and configure jobs. All other permissions are
> > restricted to system administrators for the sake of ensuring stability of
> > the system. On the dev-CI on the other hand, we're happy to add people so
> > they can experiment as much as they want. The transition to prod-CI is
> then
> > assisted by me to ensure smooth operations and adhering to the best
> > practices (like using our Jenkinsfiles and Docker structure, for
> example).
> >
> > The only case where somebody would need elevated permissions is if they
> > would like to change system settings. But at that point, we're talking
> > about instance settings and AWS account configuration. Since that now
> > reaches into the next permission level, which is restricted to the donor
> of
> > the CI system - Amazon Web Services - this is something that not even PMC
> > members will receive. The same policy is in place for the official Apache
> > CI: Committers/PMCs can configure their job, but don't have system level
> > access to either Jenkins or the underlying AWS account for obvious
> reasons.
> > We're trying to stay in line with the same policy, but in the past I've
> > granted Jenkins administrator access to people who required elevated
> access
> > to properly do their job - Aaron Markham with regards to the website
> being
> > one example.
> >
> > This means that the only case when a contributor needs committer
> assistance
> > is the moment when somebody would like to set up a new Jenkins job. It
> > would be a matter of setting up the job to point to the persons branch -
> > Jenkins will then automatically pull the Jenkinsfile and thus no further
> > configuration is necessary and updates are directly consumed. Such a
> > request IMO is on the same level as us having to cut a ticket to Apache
> > INFRA to create a new job.
> >
> > With regards to speed: So far, I was the only "CI-Person" with committer
> > privileges. But due to our 4-eye-rule for PRs, I wasn't able to merge my
> > own changes anyways - most of them were reviewed by Sheng, for example.
> In
> > an emergency, I'm sure that somebody can be reached to assist since we
> > currently have 39 PMC members and 20 committers spanning multiple
> > timezones.
> >
> > For these reasons, I don't agree with the sentiment that contributors are
> > unable to effectively work with the CI system unless they have committer
> > privileges.
> >
> > Best regards,
> > Marco
> >
> >
> > On Fri, Aug 23, 2019 at 10:33 AM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > > As Marco has open sourced the bulk of the CI infrastructure donated
> from
> > > Amazon to the community, I would like to raise the recommendation that
> > the
> > > community takes action to help volunteers working on the CI have a
> better
> > > experience. In the past, it's my impression that there hasn't been much
> > > action granting PMC or committer privileges to engineers volunteering
> to
> > > help CI other than Marco. This would encourage more contributions and
> > help
> > > expedite critical fixes and corrective actions. I think this has not
> > > properly enabled those individuals to be as effective as they could, as
> > > well as the lack of recognition for such a critical activity. I'm not
> > sure
> > > about the cause but I believe this is something that should be
> rectified
> > > for future contributions and help on the CI front if improvements are
> > > desired.
> > >
> > > In spanish we have a saying: "es de bien nacido ser agradecido".
> > >
> > > Pedro.
> > >
> > > On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi Aaron. This is difficult to diagnose, because I don't know what to
> > do
> > > > when the hash of the layer in docker doesn't match and decides to
> > rebuild
> > > > it. the r script seems not to have changed. I have observed this in
> the
> > > > past and I think is due to bugs in docker.   Maybe Kellen is able to
> > give
> > > > some tips here.
> > > >
> > > > In this case you should use -R which is already in master. (you can
> > > always
> > > > copy the script on top if you are in an older revision).
> > > >
> > > > Another thing that worked for me in the past was to completely nuke
> the
> > > > docker cache, so it redonwloads from the CI repo. After that it
> worked
> > > fine
> > > > in some cases.
> > > >
> > > > These two workarounds are not ideal, but should unblock you.
> > > >
> > > > Pedro.
> > > >
> > > > On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <
> > > aaron.s.markham@gmail.com>
> > > > wrote:
> > > >
> > > >> Is -R already in there?
> > > >>
> > > >> Here's an example of it happening to me right now.... I am making
> > > >> minor changes to the runtime_functions logic for handling the R docs
> > > >> output. I pull the fix, then run the container, but I see the R deps
> > > >> layer re-running. I didn't touch that. Why it that running again?
> > > >>
> > > >> From https://github.com/aaronmarkham/incubator-mxnet
> > > >>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
> > > >> origin/new_website_pipeline_2_aaron_rdocs
> > > >> Updating f71cc6d..deec6aa
> > > >> Fast-forward
> > > >>  ci/docker/runtime_functions.sh | 6 +++---
> > > >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > > >> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
> > > >> --docker-registry mxnetci --platform ubuntu_cpu_r
> > > >> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
> > > >> build_r_docs
> > > >> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build
> > > tool.
> > > >> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
> > > >> enabled from registry mxnetci
> > > >> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
> > > >> mxnetci/build.ubuntu_cpu_r from mxnetci
> > > >> Using default tag: latest
> > > >> latest: Pulling from mxnetci/build.ubuntu_cpu_r
> > > >> Digest:
> > > >>
> > sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
> > > >> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
> > > >> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker
> > cache
> > > >> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
> > > >> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
> > > >> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker
> build
> > > >> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
> > > >> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
> > > >> mxnetci/build.ubuntu_cpu_r docker'
> > > >> Sending build context to Docker daemon  289.8kB
> > > >> Step 1/15 : FROM ubuntu:16.04
> > > >>  ---> 5e13f8dd4c1a
> > > >> Step 2/15 : WORKDIR /work/deps
> > > >>  ---> Using cache
> > > >>  ---> afc2a135945d
> > > >> Step 3/15 : COPY install/ubuntu_core.sh /work/
> > > >>  ---> Using cache
> > > >>  ---> da2b2e7f35e1
> > > >> Step 4/15 : RUN /work/ubuntu_core.sh
> > > >>  ---> Using cache
> > > >>  ---> d1e88b26b1d2
> > > >> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
> > > >>  ---> Using cache
> > > >>  ---> 3aa97dea3b7b
> > > >> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
> > > >>  ---> Using cache
> > > >>  ---> bec503f1d149
> > > >> Step 7/15 : COPY install/ubuntu_r.sh /work/
> > > >>  ---> c5e77c38031d
> > > >> Step 8/15 : COPY install/r.gpg /work/
> > > >>  ---> d8cdbf015d2b
> > > >> Step 9/15 : RUN /work/ubuntu_r.sh
> > > >>  ---> Running in c6c90b9e1538
> > > >> ++ dirname /work/ubuntu_r.sh
> > > >> + cd /work
> > > >> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> > > >> + apt-key add r.gpg
> > > >> OK
> > > >> + add-apt-repository 'deb [arch=amd64,i386]
> > > >> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> > > >> + apt-get update
> > > >> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
> > > >>
> > > >> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
> > > >> <pedro.larroy.lists@gmail.com> wrote:
> > > >> >
> > > >> > Also, I forgot, another workaround is that I added the -R flag to
> > the
> > > >> build
> > > >> > logic (build.py) so the container is not rebuilt for manual use.
> > > >> >
> > > >> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> > > >> pedro.larroy.lists@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > >
> > > >> > > Hi Aaron.
> > > >> > >
> > > >> > > As Marco explained, if you are in master the cache usually
> works,
> > > >> there's
> > > >> > > two issues that I have observed:
> > > >> > >
> > > >> > > 1 - Docker doesn't automatically pull the base image (ex.
> > > >> ubuntu:16.04) so
> > > >> > > if your cached base which is used in the FROM statement becomes
> > > >> outdated
> > > >> > > your caching won't work. (Using docker pull ubuntu:16.04) or the
> > > base
> > > >> > > images from the container helps with this.
> > > >> > >
> > > >> > > 2 - There's another situation where the above doesn't help which
> > > >> seems to
> > > >> > > be an unidentified issue with the docker cache:
> > > >> > > https://github.com/docker/docker.github.io/issues/8886
> > > >> > >
> > > >> > > We can get a short term workaround for #1 by explicitly pulling
> > > bases
> > > >> from
> > > >> > > the script, but I think docker should do it when using
> > --cache-from
> > > so
> > > >> > > maybe contributing a patch to docker would the best approach.
> > > >> > >
> > > >> > > Pedro
> > > >> > >
> > > >> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
> > > >> aaron.s.markham@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> When you create a new Dockerfile and use that on CI, it doesn't
> > > seem
> > > >> > >> to cache some of the steps... like this:
> > > >> > >>
> > > >> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
> > > >> > >>  ---> Running in a1e522f3283b
> > > >> > >>  [91m+ echo 'Installing dependencies...'
> > > >> > >> + apt-get update
> > > >> > >>  [0mInstalling dependencies.
> > > >> > >>
> > > >> > >> Or this....
> > > >> > >>
> > > >> > >> Step 4/13 : RUN /work/ubuntu_core.sh
> > > >> > >>  ---> Running in e7882d7aa750
> > > >> > >>  [91m+ apt-get update
> > > >> > >>
> > > >> > >> I get if I was changing those scripts, but then I'd think it
> > should
> > > >> > >> cache after running it once... but, no.
> > > >> > >>
> > > >> > >>
> > > >> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
> > > >> marco.g.abreu@gmail.com>
> > > >> > >> wrote:
> > > >> > >> >
> > > >> > >> > Do I understand it correctly that you are saying that the
> > Docker
> > > >> cache
> > > >> > >> > doesn't work properly and regularly reinstalls dependencies?
> Or
> > > do
> > > >> you
> > > >> > >> mean
> > > >> > >> > that you only have cache misses when you modify the
> > dependencies
> > > -
> > > >> which
> > > >> > >> > would be expected?
> > > >> > >> >
> > > >> > >> > -Marco
> > > >> > >> >
> > > >> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> > > >> > >> aaron.s.markham@gmail.com>
> > > >> > >> > wrote:
> > > >> > >> >
> > > >> > >> > > Many of the CI pipelines follow this pattern:
> > > >> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run some
> > > >> tests. Why
> > > >> > >> > > repeat steps 1-3 over and over?
> > > >> > >> > >
> > > >> > >> > > Now, some tests use a stashed binary and docker cache. And
> I
> > > see
> > > >> this
> > > >> > >> work
> > > >> > >> > > locally, but for the most part, on CI, you're gonna sit
> > > through a
> > > >> > >> > > dependency install.
> > > >> > >> > >
> > > >> > >> > > I noticed that almost all jobs use an ubuntu setup that is
> > > fully
> > > >> > >> loaded.
> > > >> > >> > > Without cache, it can take 10 or more minutes to build.
> So I
> > > >> made a
> > > >> > >> lite
> > > >> > >> > > version. Takes only a few minutes instead.
> > > >> > >> > >
> > > >> > >> > > In some cases archiving worked great to share across
> > pipelines,
> > > >> but as
> > > >> > >> > > Marco mentioned we need a storage solution to make that
> > happen.
> > > >> We
> > > >> > >> can't
> > > >> > >> > > archive every intermediate artifact for each PR.
> > > >> > >> > >
> > > >> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
> > > >> > >> pedro.larroy.lists@gmail.com>
> > > >> > >> > > wrote:
> > > >> > >> > >
> > > >> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >> > >> > > >
> > > >> > >> > > > Pedro.
> > > >> > >> > > >
> > > >> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> > > >> > >> aaron.s.markham@gmail.com
> > > >> > >> > > >
> > > >> > >> > > > wrote:
> > > >> > >> > > >
> > > >> > >> > > > > The PRs Thomas and I are working on for the new docs
> and
> > > >> website
> > > >> > >> share
> > > >> > >> > > > the
> > > >> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds
> > things
> > > >> up a
> > > >> > >> lot.
> > > >> > >> > > > >
> > > >> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
> > > >> cjolivier01@gmail.com>
> > > >> > >> > > wrote:
> > > >> > >> > > > >
> > > >> > >> > > > > > I see it done daily now, and while I can’t share all
> > the
> > > >> > >> details,
> > > >> > >> > > it’s
> > > >> > >> > > > > not
> > > >> > >> > > > > > an incredibly complex thing, and involves not much
> more
> > > >> than
> > > >> > >> nfs/efs
> > > >> > >> > > > > > sharing and remote ssh commands.  All it takes is a
> > > little
> > > >> > >> ingenuity
> > > >> > >> > > > and
> > > >> > >> > > > > > some imagination.
> > > >> > >> > > > > >
> > > >> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > >> > >> > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > >
> > > >> > >> > > > > > wrote:
> > > >> > >> > > > > >
> > > >> > >> > > > > > > Sounds good in theory. I think there are complex
> > > details
> > > >> with
> > > >> > >> > > regards
> > > >> > >> > > > > of
> > > >> > >> > > > > > > resource sharing during parallel execution. Still I
> > > think
> > > >> > >> both ways
> > > >> > >> > > > can
> > > >> > >> > > > > > be
> > > >> > >> > > > > > > explored. I think some tests run for unreasonably
> > long
> > > >> times
> > > >> > >> for
> > > >> > >> > > what
> > > >> > >> > > > > > they
> > > >> > >> > > > > > > are doing. We already scale parts of the pipeline
> > > >> horizontally
> > > >> > >> > > across
> > > >> > >> > > > > > > workers.
> > > >> > >> > > > > > >
> > > >> > >> > > > > > >
> > > >> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > >> > >> > > > cjolivier01@apache.org>
> > > >> > >> > > > > > > wrote:
> > > >> > >> > > > > > >
> > > >> > >> > > > > > > > +1
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > Rather than remove tests (which doesn’t scale as
> a
> > > >> > >> solution), why
> > > >> > >> > > > not
> > > >> > >> > > > > > > scale
> > > >> > >> > > > > > > > them horizontally so that they finish more
> quickly?
> > > >> Across
> > > >> > >> > > > processes
> > > >> > >> > > > > or
> > > >> > >> > > > > > > > even on a pool of machines that aren’t
> necessarily
> > > the
> > > >> build
> > > >> > >> > > > machine?
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > >> > >> > > > > > marco.g.abreu@gmail.com
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > wrote:
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > > With regards to time I rather prefer us
> spending
> > a
> > > >> bit
> > > >> > >> more
> > > >> > >> > > time
> > > >> > >> > > > on
> > > >> > >> > > > > > > > > maintenance than somebody running into an error
> > > that
> > > >> > >> could've
> > > >> > >> > > > been
> > > >> > >> > > > > > > caught
> > > >> > >> > > > > > > > > with a test.
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU
> has
> > > >> been
> > > >> > >> broken
> > > >> > >> > > for
> > > >> > >> > > > > > quite
> > > >> > >> > > > > > > > > some time now, but nobody noticed that.
> Basically
> > > my
> > > >> > >> stance on
> > > >> > >> > > > that
> > > >> > >> > > > > > > > matter
> > > >> > >> > > > > > > > > is that as soon as something is not blocking,
> you
> > > can
> > > >> > >> also just
> > > >> > >> > > > > > > > deactivate
> > > >> > >> > > > > > > > > it since you don't have a forcing function in
> an
> > > open
> > > >> > >> source
> > > >> > >> > > > > project.
> > > >> > >> > > > > > > > > People will rarely come back and fix the errors
> > of
> > > >> some
> > > >> > >> nightly
> > > >> > >> > > > > test
> > > >> > >> > > > > > > that
> > > >> > >> > > > > > > > > they introduced.
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > -Marco
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > Carin Meier <carinmeier@gmail.com> schrieb am
> > Mi.,
> > > >> 14.
> > > >> > >> Aug.
> > > >> > >> > > > 2019,
> > > >> > >> > > > > > > 21:59:
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > > If a language binding test is failing for a
> not
> > > >> > >> important
> > > >> > >> > > > reason,
> > > >> > >> > > > > > > then
> > > >> > >> > > > > > > > it
> > > >> > >> > > > > > > > > > is too brittle and needs to be fixed (we have
> > > fixed
> > > >> > >> some of
> > > >> > >> > > > these
> > > >> > >> > > > > > > with
> > > >> > >> > > > > > > > > the
> > > >> > >> > > > > > > > > > Clojure package [1]).
> > > >> > >> > > > > > > > > > But in general, if we thinking of the MXNet
> > > >> project as
> > > >> > >> one
> > > >> > >> > > > > project
> > > >> > >> > > > > > > that
> > > >> > >> > > > > > > > > is
> > > >> > >> > > > > > > > > > across all the language bindings, then we
> want
> > to
> > > >> know
> > > >> > >> if
> > > >> > >> > > some
> > > >> > >> > > > > > > > > fundamental
> > > >> > >> > > > > > > > > > code change is going to break a downstream
> > > package.
> > > >> > >> > > > > > > > > > I can't speak for all the high level package
> > > >> binding
> > > >> > >> > > > maintainers,
> > > >> > >> > > > > > but
> > > >> > >> > > > > > > > I'm
> > > >> > >> > > > > > > > > > always happy to pitch in to provide code
> fixes
> > to
> > > >> help
> > > >> > >> the
> > > >> > >> > > base
> > > >> > >> > > > > PR
> > > >> > >> > > > > > > get
> > > >> > >> > > > > > > > > > green.
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > The time costs to maintain such a large CI
> > > project
> > > >> > >> obviously
> > > >> > >> > > > > needs
> > > >> > >> > > > > > to
> > > >> > >> > > > > > > > be
> > > >> > >> > > > > > > > > > considered as well.
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > [1]
> > > >> > >> https://github.com/apache/incubator-mxnet/pull/15579
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy
> <
> > > >> > >> > > > > > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > > From what I have seen Clojure is 15
> minutes,
> > > >> which I
> > > >> > >> think
> > > >> > >> > > is
> > > >> > >> > > > > > > > > reasonable.
> > > >> > >> > > > > > > > > > > The only question is that when a binding
> such
> > > as
> > > >> R,
> > > >> > >> Perl or
> > > >> > >> > > > > > Clojure
> > > >> > >> > > > > > > > > > fails,
> > > >> > >> > > > > > > > > > > some devs are a bit confused about how to
> fix
> > > >> them
> > > >> > >> since
> > > >> > >> > > they
> > > >> > >> > > > > are
> > > >> > >> > > > > > > not
> > > >> > >> > > > > > > > > > > familiar with the testing tools and the
> > > language.
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin
> Meier
> > <
> > > >> > >> > > > > > carinmeier@gmail.com
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > Great idea Marco! Anything that you think
> > > >> would be
> > > >> > >> > > valuable
> > > >> > >> > > > > to
> > > >> > >> > > > > > > > share
> > > >> > >> > > > > > > > > > > would
> > > >> > >> > > > > > > > > > > > be good. The duration of each node in the
> > > test
> > > >> stage
> > > >> > >> > > sounds
> > > >> > >> > > > > > like
> > > >> > >> > > > > > > a
> > > >> > >> > > > > > > > > good
> > > >> > >> > > > > > > > > > > > start.
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > - Carin
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de
> > > Abreu
> > > >> <
> > > >> > >> > > > > > > > > > marco.g.abreu@gmail.com>
> > > >> > >> > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Hi,
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > we record a bunch of metrics about run
> > > >> statistics
> > > >> > >> (down
> > > >> > >> > > > to
> > > >> > >> > > > > > the
> > > >> > >> > > > > > > > > > duration
> > > >> > >> > > > > > > > > > > > of
> > > >> > >> > > > > > > > > > > > > every individual step). If you tell me
> > > which
> > > >> ones
> > > >> > >> > > you're
> > > >> > >> > > > > > > > > particularly
> > > >> > >> > > > > > > > > > > > > interested in (probably total duration
> of
> > > >> each
> > > >> > >> node in
> > > >> > >> > > > the
> > > >> > >> > > > > > test
> > > >> > >> > > > > > > > > > stage),
> > > >> > >> > > > > > > > > > > > I'm
> > > >> > >> > > > > > > > > > > > > happy to provide them.
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> > > >> > >> > > > > > > > > > > > > - job
> > > >> > >> > > > > > > > > > > > > - branch
> > > >> > >> > > > > > > > > > > > > - stage
> > > >> > >> > > > > > > > > > > > > - node
> > > >> > >> > > > > > > > > > > > > - step
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Unfortunately I don't have the
> > possibility
> > > to
> > > >> > >> export
> > > >> > >> > > them
> > > >> > >> > > > > > since
> > > >> > >> > > > > > > > we
> > > >> > >> > > > > > > > > > > store
> > > >> > >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik
> > > >> doesn't
> > > >> > >> offer
> > > >> > >> > > raw
> > > >> > >> > > > > > > exports.
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Best regards,
> > > >> > >> > > > > > > > > > > > > Marco
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Carin Meier <carinmeier@gmail.com>
> > schrieb
> > > >> am
> > > >> > >> Mi., 14.
> > > >> > >> > > > > Aug.
> > > >> > >> > > > > > > > 2019,
> > > >> > >> > > > > > > > > > > 19:43:
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > I would prefer to keep the language
> > > >> binding in
> > > >> > >> the PR
> > > >> > >> > > > > > > process.
> > > >> > >> > > > > > > > > > > Perhaps
> > > >> > >> > > > > > > > > > > > we
> > > >> > >> > > > > > > > > > > > > > could do some analytics to see how
> much
> > > >> each of
> > > >> > >> the
> > > >> > >> > > > > > language
> > > >> > >> > > > > > > > > > bindings
> > > >> > >> > > > > > > > > > > > is
> > > >> > >> > > > > > > > > > > > > > contributing to overall run time.
> > > >> > >> > > > > > > > > > > > > > If we have some metrics on that,
> maybe
> > we
> > > >> can
> > > >> > >> come up
> > > >> > >> > > > > with
> > > >> > >> > > > > > a
> > > >> > >> > > > > > > > > > > guideline
> > > >> > >> > > > > > > > > > > > of
> > > >> > >> > > > > > > > > > > > > > how much time each should take.
> Another
> > > >> > >> possibility
> > > >> > >> > > is
> > > >> > >> > > > > > > leverage
> > > >> > >> > > > > > > > > the
> > > >> > >> > > > > > > > > > > > > > parallel builds more.
> > > >> > >> > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
> > > >> Larroy <
> > > >> > >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > Hi Carin.
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > That's a good point, all things
> > > >> considered
> > > >> > >> would
> > > >> > >> > > your
> > > >> > >> > > > > > > > > preference
> > > >> > >> > > > > > > > > > be
> > > >> > >> > > > > > > > > > > > to
> > > >> > >> > > > > > > > > > > > > > keep
> > > >> > >> > > > > > > > > > > > > > > the Clojure tests as part of the PR
> > > >> process
> > > >> > >> or in
> > > >> > >> > > > > > Nightly?
> > > >> > >> > > > > > > > > > > > > > > Some options are having
> notifications
> > > >> here or
> > > >> > >> in
> > > >> > >> > > > slack.
> > > >> > >> > > > > > But
> > > >> > >> > > > > > > > if
> > > >> > >> > > > > > > > > we
> > > >> > >> > > > > > > > > > > > think
> > > >> > >> > > > > > > > > > > > > > > breakages would go unnoticed maybe
> is
> > > >> not a
> > > >> > >> good
> > > >> > >> > > idea
> > > >> > >> > > > > to
> > > >> > >> > > > > > > > fully
> > > >> > >> > > > > > > > > > > remove
> > > >> > >> > > > > > > > > > > > > > > bindings from the PR process and
> just
> > > >> > >> streamline
> > > >> > >> > > the
> > > >> > >> > > > > > > process.
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > Pedro.
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM
> Carin
> > > >> Meier <
> > > >> > >> > > > > > > > > > carinmeier@gmail.com>
> > > >> > >> > > > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > Before any binding tests are
> moved
> > to
> > > >> > >> nightly, I
> > > >> > >> > > > > think
> > > >> > >> > > > > > we
> > > >> > >> > > > > > > > > need
> > > >> > >> > > > > > > > > > to
> > > >> > >> > > > > > > > > > > > > > figure
> > > >> > >> > > > > > > > > > > > > > > > out how the community can get
> > proper
> > > >> > >> > > notifications
> > > >> > >> > > > of
> > > >> > >> > > > > > > > failure
> > > >> > >> > > > > > > > > > and
> > > >> > >> > > > > > > > > > > > > > success
> > > >> > >> > > > > > > > > > > > > > > > on those nightly runs.
> Otherwise, I
> > > >> think
> > > >> > >> that
> > > >> > >> > > > > > breakages
> > > >> > >> > > > > > > > > would
> > > >> > >> > > > > > > > > > go
> > > >> > >> > > > > > > > > > > > > > > > unnoticed.
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > -Carin
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM
> > Pedro
> > > >> > >> Larroy <
> > > >> > >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Hi
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Seems we are hitting some
> > problems
> > > >> in CI.
> > > >> > >> I
> > > >> > >> > > > propose
> > > >> > >> > > > > > the
> > > >> > >> > > > > > > > > > > following
> > > >> > >> > > > > > > > > > > > > > > action
> > > >> > >> > > > > > > > > > > > > > > > > items to remedy the situation
> and
> > > >> > >> accelerate
> > > >> > >> > > turn
> > > >> > >> > > > > > > around
> > > >> > >> > > > > > > > > > times
> > > >> > >> > > > > > > > > > > in
> > > >> > >> > > > > > > > > > > > > CI,
> > > >> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
> > > >> probability of
> > > >> > >> > > > failure
> > > >> > >> > > > > > > > blocking
> > > >> > >> > > > > > > > > > PRs
> > > >> > >> > > > > > > > > > > > and
> > > >> > >> > > > > > > > > > > > > > > > > frustrating developers:
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio
> > > from
> > > >> VS
> > > >> > >> 2015 to
> > > >> > >> > > > VS
> > > >> > >> > > > > > > 2017.
> > > >> > >> > > > > > > > > The
> > > >> > >> > > > > > > > > > > > > > > > > build_windows.py infrastructure
> > > >> should
> > > >> > >> easily
> > > >> > >> > > > work
> > > >> > >> > > > > > with
> > > >> > >> > > > > > > > the
> > > >> > >> > > > > > > > > > new
> > > >> > >> > > > > > > > > > > > > > > version.
> > > >> > >> > > > > > > > > > > > > > > > > Currently some PRs are blocked
> by
> > > >> this:
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > >
> https://github.com/apache/incubator-mxnet/issues/13958
> > > >> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to
> > > >> nightly.
> > > >> > >> > > Tracked
> > > >> > >> > > > at
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > >
> https://github.com/apache/incubator-mxnet/issues/15295
> > > >> > >> > > > > > > > > > > > > > > > > * Move non-python bindings
> tests
> > to
> > > >> > >> nightly.
> > > >> > >> > > If a
> > > >> > >> > > > > > > commit
> > > >> > >> > > > > > > > is
> > > >> > >> > > > > > > > > > > > > touching
> > > >> > >> > > > > > > > > > > > > > > > other
> > > >> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should
> ask
> > > >> for a
> > > >> > >> full
> > > >> > >> > > run
> > > >> > >> > > > > > which
> > > >> > >> > > > > > > > can
> > > >> > >> > > > > > > > > be
> > > >> > >> > > > > > > > > > > > done
> > > >> > >> > > > > > > > > > > > > > > > locally,
> > > >> > >> > > > > > > > > > > > > > > > > use the label bot to trigger a
> > full
> > > >> CI
> > > >> > >> build,
> > > >> > >> > > or
> > > >> > >> > > > > > defer
> > > >> > >> > > > > > > to
> > > >> > >> > > > > > > > > > > > nightly.
> > > >> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic
> > sanity
> > > >> > >> performance
> > > >> > >> > > > > tests
> > > >> > >> > > > > > on
> > > >> > >> > > > > > > > > small
> > > >> > >> > > > > > > > > > > > > models
> > > >> > >> > > > > > > > > > > > > > > that
> > > >> > >> > > > > > > > > > > > > > > > > are run on CI and can be echoed
> > by
> > > >> the
> > > >> > >> label
> > > >> > >> > > bot
> > > >> > >> > > > > as a
> > > >> > >> > > > > > > > > comment
> > > >> > >> > > > > > > > > > > for
> > > >> > >> > > > > > > > > > > > > > PRs.
> > > >> > >> > > > > > > > > > > > > > > > > * Address unit tests that take
> > more
> > > >> than
> > > >> > >> > > 10-20s,
> > > >> > >> > > > > > > > streamline
> > > >> > >> > > > > > > > > > > them
> > > >> > >> > > > > > > > > > > > or
> > > >> > >> > > > > > > > > > > > > > > move
> > > >> > >> > > > > > > > > > > > > > > > > them to nightly if it can't be
> > > done.
> > > >> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining
> CI
> > > >> > >> infrastructure
> > > >> > >> > > > > > scripts
> > > >> > >> > > > > > > > so
> > > >> > >> > > > > > > > > > the
> > > >> > >> > > > > > > > > > > > > > > community
> > > >> > >> > > > > > > > > > > > > > > > > can contribute.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > I think our goal should be
> > > turnaround
> > > >> > >> under
> > > >> > >> > > > 30min.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > I would also like to touch base
> > > with
> > > >> the
> > > >> > >> > > > community
> > > >> > >> > > > > > that
> > > >> > >> > > > > > > > > some
> > > >> > >> > > > > > > > > > > PRs
> > > >> > >> > > > > > > > > > > > > are
> > > >> > >> > > > > > > > > > > > > > > not
> > > >> > >> > > > > > > > > > > > > > > > > being followed up by committers
> > > >> asking for
> > > >> > >> > > > changes.
> > > >> > >> > > > > > For
> > > >> > >> > > > > > > > > > example
> > > >> > >> > > > > > > > > > > > > this
> > > >> > >> > > > > > > > > > > > > > PR
> > > >> > >> > > > > > > > > > > > > > > > is
> > > >> > >> > > > > > > > > > > > > > > > > importtant and is hanging for a
> > > long
> > > >> time.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > This is another, less important
> > but
> > > >> more
> > > >> > >> > > trivial
> > > >> > >> > > > to
> > > >> > >> > > > > > > > review:
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > I think comitters requesting
> > > changes
> > > >> and
> > > >> > >> not
> > > >> > >> > > > > > folllowing
> > > >> > >> > > > > > > > up
> > > >> > >> > > > > > > > > in
> > > >> > >> > > > > > > > > > > > > > > reasonable
> > > >> > >> > > > > > > > > > > > > > > > > time is not healthy for the
> > > project.
> > > >> I
> > > >> > >> suggest
> > > >> > >> > > > > > > > configuring
> > > >> > >> > > > > > > > > > > github
> > > >> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR
> and
> > > >> > >> following up.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Regards.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Pedro.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >>
> > > >> > >
> > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message