From dev-return-29162-apmail-geode-dev-archive=geode.apache.org@geode.apache.org Mon Jul 9 14:20:06 2018 Return-Path: X-Original-To: apmail-geode-dev-archive@minotaur.apache.org Delivered-To: apmail-geode-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9116F18F0E for ; Mon, 9 Jul 2018 14:20:06 +0000 (UTC) Received: (qmail 63201 invoked by uid 500); 9 Jul 2018 14:20:06 -0000 Delivered-To: apmail-geode-dev-archive@geode.apache.org Received: (qmail 63099 invoked by uid 500); 9 Jul 2018 14:20:06 -0000 Mailing-List: contact dev-help@geode.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@geode.apache.org Delivered-To: mailing list dev@geode.apache.org Received: (qmail 62509 invoked by uid 99); 9 Jul 2018 14:20:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Jul 2018 14:20:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0B11C18040D for ; Mon, 9 Jul 2018 14:20:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=6.31 tests=[MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id qXsMpjEQ8n7v for ; Mon, 9 Jul 2018 14:20:03 +0000 (UTC) Received: from mx0b-00296801.pphosted.com (mx0b-00296801.pphosted.com [148.163.153.148]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 4DA885F1E2 for ; Mon, 9 Jul 2018 14:20:03 +0000 (UTC) Received: from pps.filterd (m0114584.ppops.net [127.0.0.1]) by mx0b-00296801.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w69EB4Mm016645 for ; Mon, 9 Jul 2018 14:20:02 GMT Received: from mail-pl0-f72.google.com (mail-pl0-f72.google.com [209.85.160.72]) by mx0b-00296801.pphosted.com with ESMTP id 2k2n9h93t4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Mon, 09 Jul 2018 14:20:02 +0000 Received: by mail-pl0-f72.google.com with SMTP id x23-v6so10189154pln.11 for ; Mon, 09 Jul 2018 07:20:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version:date :subject:message-id:references:in-reply-to:to; bh=YlaYBUEI1qyw3rcu+MJXGwW/PPryhzP2AMlUFmlejaQ=; b=W6cTwcMfGpv8gUYtfDhxYjFNIbtGTM1uMCN1YboqxcCzzgUgsjJIwaevhHa2FDvhat gQnj6AlmbCUvCmZKkswJWOd+D5WR5wWjnEHtUXDgGzGsl9kFlDrKt3DjcBjGdeo3uVOQ d2qAmvapvFIZmf9+7b0X9G4wrRcjhJFKbEHs8JwOTbHr9dxFaBaq3otL1kQfyf717xn+ oA+xjw13pKBnfzdZUjdi1zK4vX7NaQm2ojyTpCW0JDUtQtM59j0RlYJnpdoU5zZ+1Y2j CczldwwEAlmD0cdJPyJQ33HVhmZeCWTD+Lqbpf5JtczdCo24eIUrtkGInvl7ApKUmCNR sTIg== X-Gm-Message-State: APt69E24Mc9PGPp32K3Xf/m3IMOWWuQOvyOZPRm1rSuIMZqbCr6kq/bm ckvxwoZpBUboodbu0T6rAhGo1UNTDYE9sgX5Oz2Si/RWecOe5dLRsz1xv0XQ2ndN2ExNOi0Riw+ Sz5z3OzmwxIKHuP341FxUdOv+PB7u1PVUtpqo9/8= X-Received: by 2002:a63:81c3:: with SMTP id t186-v6mr13104680pgd.413.1531146001457; Mon, 09 Jul 2018 07:20:01 -0700 (PDT) X-Google-Smtp-Source: AAOMgpdGg1WFfccWXs6s0t6Y2kB2G5fBjSczq7PdQ+nrlU8TizE35KjohV0ZRRU6YJRc7cr5IR4NPg== X-Received: by 2002:a63:81c3:: with SMTP id t186-v6mr13104661pgd.413.1531146001081; Mon, 09 Jul 2018 07:20:01 -0700 (PDT) Received: from ?IPv6:2600:2100:7:8000:292d:9c03:a487:6264? ([2600:2100:7:8000:292d:9c03:a487:6264]) by smtp.gmail.com with ESMTPSA id s90-v6sm11922082pfa.178.2018.07.09.07.19.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 09 Jul 2018 07:20:00 -0700 (PDT) From: Jacob Barrett Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Date: Mon, 9 Jul 2018 07:19:59 -0700 Subject: Re: [DISCUSS] When is a test not flaky anymore? Message-Id: <0E02228F-64CE-451A-8192-2DAAC74E3E30@pivotal.io> References: <5DF2E253-8F1B-4E09-8D01-3A8C8714D959@pivotal.io> <52B5BBA9-955A-4D0E-B2C2-BA99964F4865@pivotal.io> In-Reply-To: To: dev@geode.apache.org X-Mailer: iPhone Mail (15F79) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-07-09_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1806210000 definitions=main-1807090163 +1 and the same should go for @ignore attributes as well. > On Jul 6, 2018, at 11:10 AM, Alexander Murmann wrote= : >=20 > +1 for fixing immediately. >=20 > Since Dan is already trying to shake out more brittleness this seems to be= > the right time to get rid of the flaky label. Let's just treat all test th= e > same and fix them. >=20 >> On Fri, Jul 6, 2018 at 9:31 AM, Kirk Lund wrote: >>=20 >> I should add that I'm only in favor of deleting the category if we have a= >> new policy of any failure means we have to fix the test and/or product >> code. Even if you think that failure is in a test that you or your team i= s >> not responsible for. That's no excuse to ignore a failure in your private= >> precheckin. >>=20 >>> On Fri, Jul 6, 2018 at 9:29 AM, Dale Emery wrote: >>>=20 >>> The pattern I=E2=80=99ve seen in lots of other organizations: When a few= tests >>> intermittently give different answers, people attribute the intermittenc= e >>> to the tests, quickly lose trust in the entire suite, and increasingly >>> discount failures. >>>=20 >>> If we=E2=80=99re going to attend to every failure in the larger suite, t= hen we >>> won=E2=80=99t suffer that fate, and I=E2=80=99m in favor of deleting the= Flaky tag. >>>=20 >>> Dale >>>=20 >>>> On Jul 5, 2018, at 8:15 PM, Dan Smith wrote: >>>>=20 >>>> Honestly I've never liked the flaky category. What it means is that at >>> some >>>> point in the past, we decided to put off tracking down and fixing a >>> failure >>>> and now we're left with a bug number and a description and that's it. >>>>=20 >>>> I think we will be better off if we just get rid of the flaky category >>>> entirely. That way no one can label anything else as flaky and push it >>> off >>>> for later, and if flaky tests fail again we will actually prioritize >> and >>>> fix them instead of ignoring them. >>>>=20 >>>> I think Patrick was looking at rerunning the flaky tests to see what is= >>>> still failing. How about we just run the whole flaky suite some number >> of >>>> times (100?), fix whatever is still failing and close out and remove >> the >>>> category from the rest? >>>>=20 >>>> I think will we get more benefit from shaking out and fixing the issues= >>> we >>>> have in the current codebase than we will from carefully explaining the= >>>> flaky failures from the past. >>>>=20 >>>> -Dan >>>>=20 >>>>> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery wrote: >>>>>=20 >>>>> Hi Alexander and all, >>>>>=20 >>>>>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann >>>>> wrote: >>>>>>=20 >>>>>> Hi everyone! >>>>>>=20 >>>>>> Dan Smith started a discussion about shaking out more flaky DUnit >>> tests. >>>>>> That's a great effort and I am happy it's happening. >>>>>>=20 >>>>>> As a corollary to that conversation I wonder what the criteria should= >>> be >>>>>> for a test to not be considered flaky any longer and have the >> category >>>>>> removed. >>>>>>=20 >>>>>> In general the bar should be fairly high. Even if a test only fails >> ~1 >>> in >>>>>> 500 runs that's still a problem given how many tests we have. >>>>>>=20 >>>>>> I see two ends of the spectrum: >>>>>> 1. We have a good understanding why the test was flaky and think we >>> fixed >>>>>> it. >>>>>> 2. We have a hard time reproducing the flaky behavior and have no >> good >>>>>> theory as to why the test might have shown flaky behavior. >>>>>>=20 >>>>>> In the first case I'd suggest to run the test ~100 times to get a >>> little >>>>>> more confidence that we fixed the flaky behavior and then remove the >>>>>> category. >>>>>=20 >>>>> Here=E2=80=99s a test for case 1: >>>>>=20 >>>>> If we really understand why it was flaky, we will be able to: >>>>> - Identify the =E2=80=9Cfaults=E2=80=9D=E2=80=94the broken places in= the code (whether >> system >>>>> code or test code). >>>>> - Identify the exact conditions under which those faults led to the >>>>> failures we observed. >>>>> - Explain how those faults, under those conditions. led to those >>>>> failures. >>>>> - Run unit tests that exercise the code under those same >> conditions, >>>>> and demonstrate that >>>>> the formerly broken code now does the right thing. >>>>>=20 >>>>> If we=E2=80=99re lacking any of these things, I=E2=80=99d say we=E2=80= =99re dealing with case >> 2. >>>>>=20 >>>>>> The second case is a lot more problematic. How often do we want to >> run >>> a >>>>>> test like that before we decide that it might have been fixed since >> we >>>>> last >>>>>> saw it happen? Anything else we could/should do to verify the test >>>>> deserves >>>>>> our trust again? >>>>>=20 >>>>>=20 >>>>> I would want a clear, compelling explanation of the failures we >>> observed. >>>>>=20 >>>>> Clear and compelling are subjective, of course. For me, clear and >>>>> compelling would include >>>>> descriptions of: >>>>> - The faults in the code. What, specifically, was broken. >>>>> - The specific conditions under which the code did the wrong thing. >>>>> - How those faults, under those conditions, led to those failures. >>>>> - How the fix either prevents those conditions, or causes the >> formerly >>>>> broken code to >>>>> now do the right thing. >>>>>=20 >>>>> Even if we don=E2=80=99t have all of these elements, we may have some o= f them. >>>>> That can help us >>>>> calibrate our confidence. But the elements work together. If we=E2=80=99= re >>> lacking >>>>> one, the others >>>>> are shaky, to some extent. >>>>>=20 >>>>> The more elements are missing in our explanation, the more times I=E2=80= =99d >>> want >>>>> to run the test >>>>> before trusting it. >>>>>=20 >>>>> Cheers, >>>>> Dale >>>>>=20 >>>>> =E2=80=94 >>>>> Dale Emery >>>>> demery@pivotal.io >>>>>=20 >>>>>=20 >>>=20 >>> =E2=80=94 >>> Dale Emery >>> demery@pivotal.io >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>=20