mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhavin Thaker <>
Subject Re: [Proposal] Stabilizing Apache MXNet CI build system
Date Wed, 01 Nov 2017 16:41:39 GMT
Few comments/suggestions:

1) Can  we have this nice list of todo items on the Apache MXNet wiki page
to track them better?

2) Can we have a set of owners for each set of tests and source code
directory? One of the problems I have observed is that when there is a test
failure, it is difficult to find an owner who will take the responsibility
of fixing the test OR identifying the culprit code promptly -- this causes
the master to continue to fail for many days.

3) Specifically, we need an owner for the Windows setup -- nobody seems to
know much about it -- please feel free to correct me if required.

4) +1 to have a list of all feature requests on Jira or a similar commonly
and easily accessible system.

5) -1 to the branching model -- I was the gatekeeper for the branching
model at Informix for the database kernel code to be merged to master along
with my day-job of being a database kernel engineer for around 9 months and
hence have the opinion that a branching model just shifts the burden from
one place to another. We don't have a dedicated team to do the branching
model. If we really need a buildable master everyday, then we could just
tag every successful build as last_clean_build on master -- use this tag to
get a clean master at any time. How many Apache projects are doing
development on separate branches?

6) FYI: Rahul (rahul003@) has fixed various warnings with this PR: and has a test added
that fails for any warning found. We can build on top of his work.

7) FYI: For the unit-tests problems, Meghna identified that some of the
unit-test run times have increased significantly in the recent builds. We
need volunteers to help diagnose the root-cause here:

Unit Test Task

Build #337

Build #500

Build #556

Python 2: GPU win




Python 3: GPU Win




Python2: CPU




Python3: CPU












8) Ensure that all PRs submitted have corresponding documentation on for it.  It may be fine to have documentation follow the
code changes as long as there is ownership that this task will be done in a
timely manner.  For example, I have requested the Nvidia team to submit PRs
to update documentation on for the Volta changes to MXNet.

9) Ensure that mega-PRs have some level of design or architecture
document(s) shared on the Apache MXNet wiki. The mega-PR must have both
unit-tests and nightly/integration tests submitted to demonstrate
high-quality level.

10) Finally, how do we get ownership for code submitted to MXNet? When
something fails in a code segment that only a small set of folks know
about, what is the expected SLA for a response from them? When users deploy
MXNet in production environments, they will expect some form of SLA for
support and a patch release.

Bhavin Thaker.

On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <>

> +1  That would be great.
> On Mon, Oct 30, 2017 at 5:35 PM, Hen <> wrote:
> > How about we ask for a new mxnet repo to store all the config in?
> >
> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <
> >
> > wrote:
> >
> >> Just to provide a high level overview of the ideas and proposals
> >> coming from different sources for the requirements for testing and
> >> validation of builds:
> >>
> >> * Have terraform files for the testing infrastructure. Infrastructure
> >> as code (IaC). Minus not emulated / nor cloud based, embedded
> >> hardware. ("single command" replication of the testing infrastructure,
> >> no manual steps).
> >>
> >> * CI software based on Jenkins, unless someone thinks there's a better
> >> alternative.
> >>
> >> * Use autoscaling groups and improve staggered build + test steps to
> >> achieve higher parallelism and shorter feedback times.
> >>
> >> * Switch to a branching model based on stable master + integration
> >> branch. PRs are merged into dev/integration which runs extended
> >> nightly tests, which are
> >> then merged into master, preferably in an automated way after
> >> successful extended testing.
> >> Master is always tested, and always buildable. Release branches or
> >> tags in master as usual for releases.
> >>
> >> * Build + test feedback time targeting less than 15 minutes.
> >> (Currently a build in a 16x core takes 7m). This involves lot of
> >> refactoring of tests, move expensive tests / big smoke tests to
> >> nightlies on the integration branch, also tests on IoT devices / power
> >> and performance regressions...
> >>
> >> * Add code coverage and other quality metrics.
> >>
> >> * Eliminate warnings and treat warnings as errors. We have spent time
> >> tracking down "undefined behaviour" bugs that could have been caught
> >> by compiler warnings.
> >>
> >> Is there something I'm missing or additional things that come to your
> >> mind that you would wish to add?
> >>
> >> Pedro.
> >>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message