From dev-return-7462-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org  Thu Mar 26 19:46:13 2020
Return-Path: <dev-return-7462-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 402EE180637
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 26 Mar 2020 20:46:13 +0100 (CET)
Received: (qmail 50809 invoked by uid 500); 26 Mar 2020 19:46:12 -0000
Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@mxnet.incubator.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@mxnet.incubator.apache.org>
List-Post: <mailto:dev@mxnet.incubator.apache.org>
List-Id: <dev.mxnet.incubator.apache.org>
Reply-To: dev@mxnet.incubator.apache.org
Delivered-To: mailing list dev@mxnet.incubator.apache.org
Received: (qmail 50791 invoked by uid 99); 26 Mar 2020 19:46:12 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Mar 2020 19:46:12 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 8FB0218137A
	for <dev@mxnet.apache.org>; Thu, 26 Mar 2020 19:46:11 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0
X-Spam-Level:
X-Spam-Status: No, score=0 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id n4chkWW1Chl9 for <dev@mxnet.apache.org>;
	Thu, 26 Mar 2020 19:46:09 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.166.46; helo=mail-io1-f46.google.com; envelope-from=joseph.evans@gmail.com; receiver=<UNKNOWN> 
Received: from mail-io1-f46.google.com (mail-io1-f46.google.com [209.85.166.46])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 3D9EFBB818
	for <dev@mxnet.incubator.apache.org>; Thu, 26 Mar 2020 19:46:09 +0000 (UTC)
Received: by mail-io1-f46.google.com with SMTP id y24so7375897ioa.8
        for <dev@mxnet.incubator.apache.org>; Thu, 26 Mar 2020 12:46:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=HCgDV691XfyCBftFqK2XSgc89QgiPPTyMkJ9Vk/H1a0=;
        b=LMGHztr9kCfSvGcky7rII4sJXcSWA/a1FJ8X0VkJw/om0T/qzJZqgtwZhz91gXo+5U
         6afGtI3+5xNfngpK34w75i78vjpWjLhvqi7zuIqNzvL8IQmMJho7wgPpmXo/I0EKAkyb
         Co8T9aYJpdauT0P9K5b2HDyV7O/wUZ9qVYUURYhHWgHdyDjgiJhGZ1tLFBO3KqpORe89
         /yUjyaOj/mDA6u0rZSSk+UwOECYq1vntPWaDOXxavazEd4drDNSeoE6ZHj9it5o96/bf
         2b9oBGSMOr4WJ9rkIuKRgbFsQxGjHUyzQAJDN1yPoTKCsjewtTznwAwUdCtiYsmPW5H8
         jjEQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=HCgDV691XfyCBftFqK2XSgc89QgiPPTyMkJ9Vk/H1a0=;
        b=S/tZPUwhEpMwhPis9IKci/2llPE3WvrujXP67C58luteLu+iWbZt2JCQKMmiEcS+Fn
         554WJQrAIC3jbL0s9g0p5Qs3/IDz76VJ4G1Vx5V2bGTG2FhCVSf013qNd/kbdiNefyb/
         0xy1n7pQCLL0zbhj1V+D5VxfuaKq+Qgps6SpH8oEpiIknFFAIUiIEW8DEv8YNYF29FFQ
         GjeOGTqljVcE8yxnZ6cjTRIzGoewFqA+7zpJxGP47/dKXGzfCHOIteXH//PXrQUO5hz1
         rI0peUJUVrJ6wfkpUorELX5PkK0jjQYpbrLghuKziwpIkIsGn/GUEWYDDC+NwEzOYWsz
         PDrA==
X-Gm-Message-State: ANhLgQ1XXGij2l3JT2DpWz6PcjkWcVG89T/VBErDiOly0Gd3AFWwwbgV
	GS+iyza6E1eyNoozbV5SkwcSQ1z3Imho5Y2qEQwiXTxnqO8=
X-Google-Smtp-Source: ADFU+vumWqAi194VvUIkhASzMg28Qggxgrz53E/WIpEt2wtZPKq0ZZq6+vcqnv0uMffSxdmUskA4wDc80yp28euohOM=
X-Received: by 2002:a6b:3c01:: with SMTP id k1mr9294952iob.120.1585251968301;
 Thu, 26 Mar 2020 12:46:08 -0700 (PDT)
MIME-Version: 1.0
References: <CAJEutN-k2ZSPP=17jPPrrFw=L3G00XZvMTfPBAfbHU42krHbDw@mail.gmail.com>
 <CAHTWJDPb=bVgaZ2N1qxWo9KaK_1ZsH+9VnGEdyM0_Ugsog+kgA@mail.gmail.com>
 <CAH1G5Zr5p-u0t-wyLCiUNySDsB792QKMkbpx0etQpE9rj43OJQ@mail.gmail.com>
 <CAHTWJDNcL2SyRMri9jvAc3EzOWX347iH7du25V9FXSwMiHnbvw@mail.gmail.com>
 <CAJEutN-uYNDf1ujbqLGi4a41+8auTFesLkyaw1xARk+vBkab6A@mail.gmail.com> <CAHTWJDOx_1ieKowUjeRz5J8ZNFhEGZJgTf+CZX103cP7otNk6w@mail.gmail.com>
In-Reply-To: <CAHTWJDOx_1ieKowUjeRz5J8ZNFhEGZJgTf+CZX103cP7otNk6w@mail.gmail.com>
From: Joe Evans <joseph.evans@gmail.com>
Date: Thu, 26 Mar 2020 12:45:57 -0700
Message-ID: <CAJEutN8ZDKkzoe0yFvawAN=zCo_8gKX_H8gakdpQcfOk4Oiyfw@mail.gmail.com>
Subject: Re: CI Pipeline Change Proposal
To: dev@mxnet.incubator.apache.org
Content-Type: multipart/alternative; boundary="00000000000073440d05a1c73cfe"

--00000000000073440d05a1c73cfe
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

The sanity-lint check pulls a docker image cache, builds a new container
and runs inside. The docker setup is taking around 3 minutes, at least:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-vali=
dation%2Fsanity/detail/master/1764/pipeline/39

We could improve this by not having to build a new container every time.
Also, our CI containers are huge so it takes awhile to pull them down. I'm
sure we could reduce the size by being a bit more careful in building them
too.

Joe

On Thu, Mar 26, 2020 at 12:33 PM Marco de Abreu <marco.g.abreu@gmail.com>
wrote:

> Do you know what's driving the duration for sanity? It used to be 50 sec
> execution and 60 sec preparation.
>
> -Marco
>
> Joe Evans <joseph.evans@gmail.com> schrieb am Do., 26. M=C3=A4rz 2020, 20=
:31:
>
> > Thanks Marco and Aaron for your input.
> >
> > > Can you show by how much the duration will increase?
> >
> > The average sanity build time is around 10min, while the average build
> time
> > for unix-cpu is about 2 hours, so the entire build pipeline would
> increase
> > by 2 hours if we required both unix-cpu and sanity to complete in
> parallel.
> >
> > I took a look at the CloudWatch metrics we're saving for Jenkins jobs.
> Here
> > is the failure rate per job, based on builds triggered by PRs in the pa=
st
> > year. As you can see, the sanity build failure is still fairly high and
> > would save a lot of unneeded build jobs.
> >
> > Job Successful Failed Failure Rate
> > sanity 6900 2729 28.34%
> > unix-cpu 4268 4786 52.86%
> > unix-gpu 3686 5637 60.46%
> > centos-cpu 6777 2809 29.30%
> > centos-gpu 6318 3350 34.65%
> > clang 7879 1588 16.77%
> > edge 7654 1933 20.16%
> > miscellaneous 8090 1510 15.73%
> > website 7226 2179 23.17%
> > windows-cpu 6084 3621 37.31%
> > windows-gpu 5191 4721 47.63%
> >
> > We can start by requiring only the sanity job to complete before
> triggering
> > the rest, and collect data to decide if it makes sense to change it fro=
m
> > there. Any objections to this approach?
> >
> > Thanks.
> > Joe
> >
> >
> > On Wed, Mar 25, 2020 at 9:35 AM Marco de Abreu <marco.g.abreu@gmail.com=
>
> > wrote:
> >
> > > Back then I have created a system which exports all Jenkins results t=
o
> > > cloud watch. It does not include individual test results but rather
> > stages
> > > and jobs. The data for the sanity check should be available there.
> > >
> > > Something I'd also be curious about is the percentage of the failures
> in
> > > one run. Speak, if a commit failed, have there been multiple jobs
> failing
> > > (indicating an error in the code) or only one or two (indicating
> > > flakyness). This should give us a proper understanding of how
> unnecessary
> > > these runs really are.
> > >
> > > -Marck
> > >
> > > Aaron Markham <aaron.s.markham@gmail.com> schrieb am Mi., 25. M=C3=A4=
rz
> 2020,
> > > 16:53:
> > >
> > > > +1 for sanity check - that's fast.
> > > > -1 for unix-cpu - that's slow and can just hang.
> > > >
> > > > So my suggestion would be to see the data apart - what's the failur=
e
> > > > rate on the sanity check and the unix-cpu? Actually, can we get a
> > > > table of all of the tests with this data?!
> > > > If the sanity check fails... let's say 20% of the time, but only
> takes
> > > > a couple of minutes, then ya, let's stack it and do that one first.
> > > >
> > > > I think unix-cpu needs to be broken apart. It's too complex and fai=
ls
> > > > in multiple ways. Isolate the brittle parts. Then we can
> > > > restart/disable those as needed, while all of the other parts pass
> and
> > > > don't have to be rerun.
> > > >
> > > > On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu <
> > marco.g.abreu@gmail.com>
> > > > wrote:
> > > > >
> > > > > We had this structure in the past and the community was bothered =
by
> > CI
> > > > > taking more time, thus we moved to the current model with
> everything
> > > > > parallelized. We'd basically revert that then.
> > > > >
> > > > > Can you show by how much the duration will increase?
> > > > >
> > > > > Also, we have zero test parallelisation, speak we are running one
> > test
> > > on
> > > > > 72 core machines (although multiple workers). Wouldn't it be way
> more
> > > > > efficient to add parallelisation and thus heavily reduce the time
> > spent
> > > > on
> > > > > the tasks instead of staggering?
> > > > >
> > > > > I feel concerned that these measures to save cost are paid in the
> > form
> > > > of a
> > > > > worse user experience. I see a big potential to save costs by
> > > increasing
> > > > > efficiency while actually improving the user experience due to CI
> > being
> > > > > faster.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Joe Evans <joseph.evans@gmail.com> schrieb am Mi., 25. M=C3=A4rz =
2020,
> > > 04:58:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > >
> > > > > > First, I just wanted to introduce myself to the MXNet community=
.
> > I=E2=80=99m
> > > > Joe
> > > > > > and will be working with Chai and the AWS team to improve some
> > issues
> > > > > > around MXNet CI. One of our goals is to reduce the costs
> associated
> > > > with
> > > > > > running MXNet CI. The task I=E2=80=99m working on now is this i=
ssue:
> > > > > >
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/17802
> > > > > >
> > > > > >
> > > > > > Proposal: Staggered Jenkins CI pipeline
> > > > > >
> > > > > >
> > > > > > Based on data collected from Jenkins, around 55% of the time wh=
en
> > the
> > > > > > mxnet-validation CI build is triggered by a PR, either the sani=
ty
> > or
> > > > > > unix-cpu builds fail. When either of these builds fail, it
> doesn=E2=80=99t
> > > make
> > > > > > sense to run the rest of the pipelines and utilize all those
> > > resources
> > > > if
> > > > > > we=E2=80=99ve already identified a build or unit test failure.
> > > > > >
> > > > > >
> > > > > > We are proposing changing the MXNet Jenkins CI pipeline by
> > requiring
> > > > the
> > > > > > *sanity* and *unix-cpu* builds to complete and pass tests
> > > successfully
> > > > > > before starting the other build pipelines (centos-cpu/gpu,
> > unix-gpu,
> > > > > > windows-cpu/gpu, etc.) Once the sanity builds successfully
> > complete,
> > > > the
> > > > > > remaining build pipelines will be triggered and run in parallel
> (as
> > > > they
> > > > > > currently do.) The purpose of this change is to identify faulty
> > code
> > > or
> > > > > > compatibility issues early and prevent further execution of CI
> > > builds.
> > > > This
> > > > > > will increase the time required to test a PR, but will prevent
> > > > unnecessary
> > > > > > builds from running.
> > > > > >
> > > > > >
> > > > > > Does anyone have any concerns with this change or suggestions?
> > > > > >
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Joe Evans
> > > > > >
> > > > > > joseph.evans@gmail.com
> > > > > >
> > > >
> > >
> >
>

--00000000000073440d05a1c73cfe--