mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco de Abreu <marco.g.ab...@gmail.com>
Subject Re: CI Pipeline Change Proposal
Date Thu, 26 Mar 2020 19:33:22 GMT
Do you know what's driving the duration for sanity? It used to be 50 sec
execution and 60 sec preparation.

-Marco

Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. März 2020, 20:31:

> Thanks Marco and Aaron for your input.
>
> > Can you show by how much the duration will increase?
>
> The average sanity build time is around 10min, while the average build time
> for unix-cpu is about 2 hours, so the entire build pipeline would increase
> by 2 hours if we required both unix-cpu and sanity to complete in parallel.
>
> I took a look at the CloudWatch metrics we're saving for Jenkins jobs. Here
> is the failure rate per job, based on builds triggered by PRs in the past
> year. As you can see, the sanity build failure is still fairly high and
> would save a lot of unneeded build jobs.
>
> Job Successful Failed Failure Rate
> sanity 6900 2729 28.34%
> unix-cpu 4268 4786 52.86%
> unix-gpu 3686 5637 60.46%
> centos-cpu 6777 2809 29.30%
> centos-gpu 6318 3350 34.65%
> clang 7879 1588 16.77%
> edge 7654 1933 20.16%
> miscellaneous 8090 1510 15.73%
> website 7226 2179 23.17%
> windows-cpu 6084 3621 37.31%
> windows-gpu 5191 4721 47.63%
>
> We can start by requiring only the sanity job to complete before triggering
> the rest, and collect data to decide if it makes sense to change it from
> there. Any objections to this approach?
>
> Thanks.
> Joe
>
>
> On Wed, Mar 25, 2020 at 9:35 AM Marco de Abreu <marco.g.abreu@gmail.com>
> wrote:
>
> > Back then I have created a system which exports all Jenkins results to
> > cloud watch. It does not include individual test results but rather
> stages
> > and jobs. The data for the sanity check should be available there.
> >
> > Something I'd also be curious about is the percentage of the failures in
> > one run. Speak, if a commit failed, have there been multiple jobs failing
> > (indicating an error in the code) or only one or two (indicating
> > flakyness). This should give us a proper understanding of how unnecessary
> > these runs really are.
> >
> > -Marck
> >
> > Aaron Markham <aaron.s.markham@gmail.com> schrieb am Mi., 25. März 2020,
> > 16:53:
> >
> > > +1 for sanity check - that's fast.
> > > -1 for unix-cpu - that's slow and can just hang.
> > >
> > > So my suggestion would be to see the data apart - what's the failure
> > > rate on the sanity check and the unix-cpu? Actually, can we get a
> > > table of all of the tests with this data?!
> > > If the sanity check fails... let's say 20% of the time, but only takes
> > > a couple of minutes, then ya, let's stack it and do that one first.
> > >
> > > I think unix-cpu needs to be broken apart. It's too complex and fails
> > > in multiple ways. Isolate the brittle parts. Then we can
> > > restart/disable those as needed, while all of the other parts pass and
> > > don't have to be rerun.
> > >
> > > On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu <
> marco.g.abreu@gmail.com>
> > > wrote:
> > > >
> > > > We had this structure in the past and the community was bothered by
> CI
> > > > taking more time, thus we moved to the current model with everything
> > > > parallelized. We'd basically revert that then.
> > > >
> > > > Can you show by how much the duration will increase?
> > > >
> > > > Also, we have zero test parallelisation, speak we are running one
> test
> > on
> > > > 72 core machines (although multiple workers). Wouldn't it be way more
> > > > efficient to add parallelisation and thus heavily reduce the time
> spent
> > > on
> > > > the tasks instead of staggering?
> > > >
> > > > I feel concerned that these measures to save cost are paid in the
> form
> > > of a
> > > > worse user experience. I see a big potential to save costs by
> > increasing
> > > > efficiency while actually improving the user experience due to CI
> being
> > > > faster.
> > > >
> > > > -Marco
> > > >
> > > > Joe Evans <joseph.evans@gmail.com> schrieb am Mi., 25. März 2020,
> > 04:58:
> > > >
> > > > > Hi,
> > > > >
> > > > >
> > > > > First, I just wanted to introduce myself to the MXNet community.
> I’m
> > > Joe
> > > > > and will be working with Chai and the AWS team to improve some
> issues
> > > > > around MXNet CI. One of our goals is to reduce the costs associated
> > > with
> > > > > running MXNet CI. The task I’m working on now is this issue:
> > > > >
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/17802
> > > > >
> > > > >
> > > > > Proposal: Staggered Jenkins CI pipeline
> > > > >
> > > > >
> > > > > Based on data collected from Jenkins, around 55% of the time when
> the
> > > > > mxnet-validation CI build is triggered by a PR, either the sanity
> or
> > > > > unix-cpu builds fail. When either of these builds fail, it doesn’t
> > make
> > > > > sense to run the rest of the pipelines and utilize all those
> > resources
> > > if
> > > > > we’ve already identified a build or unit test failure.
> > > > >
> > > > >
> > > > > We are proposing changing the MXNet Jenkins CI pipeline by
> requiring
> > > the
> > > > > *sanity* and *unix-cpu* builds to complete and pass tests
> > successfully
> > > > > before starting the other build pipelines (centos-cpu/gpu,
> unix-gpu,
> > > > > windows-cpu/gpu, etc.) Once the sanity builds successfully
> complete,
> > > the
> > > > > remaining build pipelines will be triggered and run in parallel (as
> > > they
> > > > > currently do.) The purpose of this change is to identify faulty
> code
> > or
> > > > > compatibility issues early and prevent further execution of CI
> > builds.
> > > This
> > > > > will increase the time required to test a PR, but will prevent
> > > unnecessary
> > > > > builds from running.
> > > > >
> > > > >
> > > > > Does anyone have any concerns with this change or suggestions?
> > > > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Joe Evans
> > > > >
> > > > > joseph.evans@gmail.com
> > > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message