Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 63FC12009C6 for ; Tue, 17 May 2016 01:39:18 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6289B160A19; Mon, 16 May 2016 23:39:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 540F8160A16 for ; Tue, 17 May 2016 01:39:17 +0200 (CEST) Received: (qmail 78325 invoked by uid 500); 16 May 2016 23:39:16 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 78313 invoked by uid 99); 16 May 2016 23:39:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 May 2016 23:39:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 53DEC1A025C for ; Mon, 16 May 2016 23:39:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.299 X-Spam-Level: * X-Spam-Status: No, score=1.299 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_IMAGE_ONLY_32=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id FbMalzSaPton for ; Mon, 16 May 2016 23:39:13 +0000 (UTC) Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 388A55F239 for ; Mon, 16 May 2016 23:39:12 +0000 (UTC) Received: by mail-ig0-f176.google.com with SMTP id qe5so59701238igc.1 for ; Mon, 16 May 2016 16:39:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera-com.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=8V5VPpp3+Ty3zFPP0cskVZZ7MnG1v9J0mcjyWwsCAk4=; b=D6iM/89LTtiHs+nVb2p9/WguYdK8G0P8sTwX0tKR2RyCJocoK+sqAdPz789m1y8icU h7hra1flSCBsCy9KK9/l2oQm5mpSByMYgfKiHTrU2OvERrvfhA9eWfzLWTRJFbm3O7UP HHIGB+zRmaJgEzx3Ax2i3NsGoYdj8Cfh0tTjpxDVwYs1WeaRomqYvQEOTDPZVHOnIQDE J5wPcNlmqETshNXcIVBsDTP6NKsh782UJ2KN6FuHiaNmS/SjUIpFDfvXj2buKfnnqeJd rLfd4W1QkWZ0mxTp3NQmnAc4Cnua1e5IfgT+xikAtm2P+Xfv5wBaQA5m71szBPbbesxE 0vEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=8V5VPpp3+Ty3zFPP0cskVZZ7MnG1v9J0mcjyWwsCAk4=; b=L1+poiB4+a4oznkQLY6tc+O63vMUFrb9NFQVRyq6SjYfDXOwk3xWvhAJs/EFKdzaSJ gmBOhplGrG4NHCydOk1NnfsrJtrKgLGYxGQmGHnw1b6JLDkSrDl0tM2/a/qGv4lzFTNJ fxhgD83oFiVvqsh3AJ9sp6D7TysrUrMUQbyHNRhHp5CJ3PgdVq0nIW15AMXy2NKuuV8w tiRQ8L5oVmQh888px13j9x+KjreIMxtOztrAKUa8A3fCeitXTVpqFfLBLL+ardcqYTiq peJpfafTGj3JxYbK19BFkV+eRdjdIFeHS4A68UZtRn7sgkzWYPVBejy0Na0vD/3jVYXj NW+Q== X-Gm-Message-State: AOPr4FVohC0V2uJvuo9GLLU6Kf51wY4sp3pMn8sVzyuoVOP7YXOH49uPKO97e9Zfd7WF19B9KrMrQEvc+YXF1wEd X-Received: by 10.50.85.15 with SMTP id d15mr11392966igz.8.1463441944415; Mon, 16 May 2016 16:39:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.79.103.197 with HTTP; Mon, 16 May 2016 16:38:35 -0700 (PDT) From: Apekshit Sharma Date: Mon, 16 May 2016 16:38:35 -0700 Message-ID: Subject: Smart Flaky Handler To: dev@hbase.apache.org Content-Type: multipart/related; boundary=089e013d0c304033e30532fe2149 archived-at: Mon, 16 May 2016 23:39:18 -0000 --089e013d0c304033e30532fe2149 Content-Type: multipart/alternative; boundary=089e013d0c304033de0532fe2148 --089e013d0c304033de0532fe2148 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable This mail is to introduce the work to tackle the flaky tests in our build. *Why is it important?* - Our build history sucks, last 175 post-commit runs failed. We need to make it useful. - To better understand our code=E2=80=99s testing status, more importantly = it=E2=80=99s weak points. - We know those 2-3 tests which keep failing every now and then, but not those ~10 nasty ones which fail like 1 out of 50 times, and screw our build= . - This isn=E2=80=99t something that can be done manually on a daily basis. = We need automation. *Changes made so far:* Code changes: HBASE-15839 (Umbrella issue) *Jenkins changes:* [Diagram link: https://issues.apache.org/jira/secure/attachment/12804292/Screen%20Shot%202= 016-05-16%20at%204.02.46%20PM.png ] =E2=80=8B *(new job) HBase-Find-Flaky-Tests*: Gets test reports of recent builds of post-commit job (TRUNK_matrix) and HBase-Flaky-Tests job (see below) to find flaky tests. Frequency of run determines how fast we catch test regressions. So if we run it every 4 hours, any test which started failing in post-commit job (TRUNK_matrix) in last 4 hour will be blacklisted. *(new job) HBase-Flaky-Tests*: This job runs only the flaky tests. The aim is to run this job back-to-back to collect as many runs as we can. Higher the run rate, the better will be our system at catching the flaky tests. We currently run it hourly. so we=E2=80=99ll be able to keep track of flaky te= sts with ~5% failure rate or more. *Post-commit (TRUNK_matrix) and pre-commit jobs*: Exclude these flaky tests= . *So what if a bad commit makes a good test bad?* Since the test is not bad, it=E2=80=99ll run in next post-commit and will f= ail. Next run of HBase-Find-Flaky-Tests will pick it up and blacklist it. Blacklisting will help keep the post-commit job and more importantly pre-commit job clean, a problem we face quite often. *Are we just tucking away are shit?* Nope, this will help us: - first, Maintain a list of bad test (we lack that today). - second, make our build greener to the point that a failed/red build is something we worry about seriously. Once we are confident that the system is working fine, we=E2=80=99ll setup = up HBase-Find-Flaky-Tests job to send reports to dev@hbase so that devs know about the bad tests. If it remains hidden somewhere in a jenkins job=E2=80= =99s archive, it=E2=80=99s unlike that we=E2=80=99ll actively work on getting th= em fixed :). I'll keep posting further updates on this thread. -- Appy --089e013d0c304033de0532fe2148 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
This mail is to introduce the work to tackle the flaky tests in our build.<= br>
Why is it important?
- Our build history sucks, last 175 p= ost-commit runs failed. We need to make it useful.
- To better understa= nd our code=E2=80=99s testing status, more importantly it=E2=80=99s weak po= ints.
- We know those 2-3 tests which keep failing every now and then, b= ut not those ~10 nasty ones which fail like 1 out of 50 times, and screw ou= r build.
- This isn=E2=80=99t something that can be done manually on a d= aily basis. We need automation.

Changes made so far:
Code = changes: HBASE-15839 =C2=A0(Umbrella issue)

J= enkins changes:


=E2=80=8B
(new job) HBase= -Find-Flaky-Tests: Gets test reports of recent builds of post-commit jo= b (TRUNK_matrix) and HBase-Flaky-Tests job (see below) to find flaky tests.= Frequency of run determines how fast we catch test regressions. So if we r= un it every 4 hours, any test which started failing in post-commit job (TRU= NK_matrix) in last 4 hour will be blacklisted.

(new jo= b) HBase-Flaky-Tests: This job runs only the flaky tests. The aim is to= run this job back-to-back to collect as many runs as we can. Higher the ru= n rate, the better will be our system at catching the flaky tests. We curre= ntly run it hourly. so we=E2=80=99ll be able to keep track of flaky tests w= ith ~5% failure rate or more.

Post-commit (TRUNK_matri= x) and pre-commit jobs: Exclude these flaky tests.


So wha= t if a bad commit makes a good test bad?
Since the test is not bad, = it=E2=80=99ll run in next post-commit and will fail. Next run of HBase-Find= -Flaky-Tests will =C2=A0pick it up and blacklist it. Blacklisting will help= keep the post-commit job and more importantly pre-commit job clean, a prob= lem we face quite often.

Are we just tucking away are shit?Nope, this will help us:
- first, Maintain a list of bad test (we lack= that today).
- second, make our build greener to the point that a faile= d/red build is something we worry about seriously.

Once we are confi= dent that the system is working fine, we=E2=80=99ll setup up HBase-Find-Fla= ky-Tests job to send reports to dev@hbase so that devs know about the bad t= ests. If it remains hidden somewhere in a jenkins job=E2=80=99s archive, it= =E2=80=99s unlike that we=E2=80=99ll actively work on getting them fixed :)= .=C2=A0

I'll keep posting further updates on t= his thread.

-- Appy
--089e013d0c304033de0532fe2148-- --089e013d0c304033e30532fe2149--