Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 822E2200C1C for ; Wed, 15 Feb 2017 21:51:58 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 80BE7160B5E; Wed, 15 Feb 2017 20:51:58 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7A2EC160B4D for ; Wed, 15 Feb 2017 21:51:57 +0100 (CET) Received: (qmail 13623 invoked by uid 500); 15 Feb 2017 20:51:56 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 13612 invoked by uid 99); 15 Feb 2017 20:51:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Feb 2017 20:51:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 58A3A1A035E for ; Wed, 15 Feb 2017 20:51:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.13 X-Spam-Level: X-Spam-Status: No, score=0.13 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=hotmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id jni35aAU3euz for ; Wed, 15 Feb 2017 20:51:53 +0000 (UTC) Received: from COL004-OMC4S15.hotmail.com (col004-omc4s15.hotmail.com [65.55.34.217]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 2B0675F5CC for ; Wed, 15 Feb 2017 20:51:52 +0000 (UTC) Received: from NAM03-DM3-obe.outbound.protection.outlook.com ([65.55.34.199]) by COL004-OMC4S15.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Wed, 15 Feb 2017 12:51:45 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=qLFM2HTbGR3/atmagcIEszs5s4VuGQY7PoMpIQwE/+E=; b=BOdWHmkP/ps1s7mgNJOq4UnOVbO2q84240FITkE1hgvhJ8hJOhCYxJMiyDifKYyqbXmYc/k8dQTKXKNw51RzJQDz6Swj5xkQ2Q5He/vrfmgZ+NXW62s6PZD8zYGg3eWIIF3VnIoVBXfMgMRbP7hecwcJfHEb9Ol5llkUyMTEpUO/cgxHzV3+ssPDXm9q0XOtidpvuodrB3yKYhSpBx8BapAQS/tTJ34+PD+XEBmm33CdQD/2UVnqYI+ZSaUjE224RRPIoHjk0C9a9oGD4Hfe8HOeNd4i/O7fUZG1sMNtmdLAErG7FR+B38xMhW4Mpt36/eXmJm+7SFap20Viph2S+A== Received: from BY2NAM03FT014.eop-NAM03.prod.protection.outlook.com (10.152.84.58) by BY2NAM03HT105.eop-NAM03.prod.protection.outlook.com (10.152.85.157) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.919.10; Wed, 15 Feb 2017 20:51:40 +0000 Received: from BN6PR1401MB1987.namprd14.prod.outlook.com (10.152.84.52) by BY2NAM03FT014.mail.protection.outlook.com (10.152.84.239) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.919.10 via Frontend Transport; Wed, 15 Feb 2017 20:51:39 +0000 Received: from BN6PR1401MB1987.namprd14.prod.outlook.com ([10.174.116.144]) by BN6PR1401MB1987.namprd14.prod.outlook.com ([10.174.116.144]) with mapi id 15.01.0888.030; Wed, 15 Feb 2017 20:51:39 +0000 From: Saikat Kanjilal To: Armin Braun CC: Kay Ousterhout , "dev@spark.apache.org" Subject: Re: File JIRAs for all flaky test failures Thread-Topic: File JIRAs for all flaky test failures Thread-Index: AQHSh8eUa0E1PurPaUWE4poSAaOSD6FqgMPOgAAJXgCAAAClug== Date: Wed, 15 Feb 2017 20:51:39 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=hotmail.com; x-incomingtopheadermarker: OriginalChecksum:A5D67EB8C96DD1A382E104F14110BD2942AC4C44154D0CC8BD869DFD060550EC;UpperCasedChecksum:682293564234AF701B062D4D40BFADC77F440026FC0C53886D36B24A06CB5081;SizeAsReceived:8037;Count:40 x-ms-exchange-messagesentrepresentingtype: 1 x-tmn: [6gDh8y1+Lk/meeAFPE0p5gFSeHTWJfAV] x-incomingheadercount: 40 x-eopattributedmessage: 0 x-microsoft-exchange-diagnostics: 1;BY2NAM03HT105;7:aCoyqZ7bHKyPbQQMPJb+gizDML7WNVgWKZyOgNnFb7LzDVVzQjlzCZYwAw8N6AW0vujuF4f5YuS4lHVDmbEw2FBhzjEQN544qBjFT4yF8zqjHO8KfKsfROhEdEfuggbq7XHROopUnskQ5ePc2D2qNplnn/6ZudJ+unL3wbWrBE1pfdoboNZyyr9379J4aHb55Jer++WTBCTTG9Hcz26T9s0B2GZBhdr1a/qCjC185DmaUw53yTMxbWVvrfOlfOloeQTY6i/lbwT8pW7zSPenEdASOw4TtVY/VHMiurmr9s4KvOBSdidmu0hl3n5zrd0u x-forefront-antispam-report: EFV:NLI;SFV:NSPM;SFS:(10019020)(7916002)(377454003)(53754006)(24454002)(6916009)(2950100002)(53546003)(4000100100001)(76176999)(106116001)(6606003)(50986999)(53386004)(19627405001)(110136004)(87572001)(39060400002)(104016004)(229853002)(7696004)(6246003)(102836003)(3660700001)(82202002)(236005)(5660300001)(54356999)(3280700002)(6306002)(54896002)(33656002)(7906003)(4326007)(2900100001)(6506006)(73972006)(122556002)(92566002)(189998001)(8676002)(86362001)(626004)(74316002)(54906002)(20460500001)(9686003)(6436002)(606005)(77096006)(31430400001)(99286003)(25786008)(55016002)(8936002)(15852004)(42262002);DIR:OUT;SFP:1102;SCL:1;SRVR:BY2NAM03HT105;H:BN6PR1401MB1987.namprd14.prod.outlook.com;FPR:;SPF:None;MLV:sfv;LANG:en; x-ms-office365-filtering-correlation-id: b9ffa014-0c1a-439b-5d18-08d455e470a1 x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(201702061030)(5061506515)(5061507331)(1603103135)(1603101373)(1601125196)(1701031045);SRVR:BY2NAM03HT105; x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(432015086)(82015046);SRVR:BY2NAM03HT105;BCL:0;PCL:0;RULEID:;SRVR:BY2NAM03HT105; x-forefront-prvs: 021975AE46 spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: multipart/alternative; boundary="_000_BN6PR1401MB1987A924376328B3EDABD5CEA85B0BN6PR1401MB1987_" MIME-Version: 1.0 X-OriginatorOrg: hotmail.com X-MS-Exchange-CrossTenant-originalarrivaltime: 15 Feb 2017 20:51:39.5199 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY2NAM03HT105 X-OriginalArrivalTime: 15 Feb 2017 20:51:45.0052 (UTC) FILETIME=[5160F5C0:01D287CD] archived-at: Wed, 15 Feb 2017 20:51:58 -0000 --_000_BN6PR1401MB1987A924376328B3EDABD5CEA85B0BN6PR1401MB1987_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I would recommend we just open JIRA's for unit tests based on module (core/= ml/sql etc) and we fix this one module at a time, this at least keeps the n= umber of unit tests needing fixing down to a manageable number. ________________________________ From: Armin Braun Sent: Wednesday, February 15, 2017 12:48 PM To: Saikat Kanjilal Cc: Kay Ousterhout; dev@spark.apache.org Subject: Re: File JIRAs for all flaky test failures I think one thing that is contributing to this a lot too is the general iss= ue of the tests taking up a lot of file descriptors (10k+ if I run them on = a standard Debian machine). There are a few suits that contribute to this in particular like `org.apach= e.spark.ExecutorAllocationManagerSuite` which, like a few others, appears t= o consume a lot of fds. Wouldn't it make sense to open JIRAs about those and actively try to reduce= the resource consumption of these tests? Seems to me these can cause a lot of unpredictable behavior (making the rea= son for flaky tests hard to identify especially when there's timeouts etc. = involved) + they make it prohibitively expensive for many to test locally i= mo. On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal > wrote: I was working on something to address this a while ago https://issues.apach= e.org/jira/browse/SPARK-9487 but the difficulty in testing locally made thi= ngs a lot more complicated to fix for each of the unit tests, should we res= urface this JIRA again, I would whole heartedly agree with the flakiness as= sessment of the unit tests. [SPARK-9487] Use the same num. worker threads in Scala ... issues.apache.org In Python we use `local[4]` for unit tests, while in Scala/Java we use `loc= al[2]` and `local` for some unit tests in SQL, MLLib, and other components.= If the ... ________________________________ From: Kay Ousterhout > Sent: Wednesday, February 15, 2017 12:10 PM To: dev@spark.apache.org Subject: File JIRAs for all flaky test failures Hi all, I've noticed the Spark tests getting increasingly flaky -- it seems more co= mmon than not now that the tests need to be re-run at least once on PRs bef= ore they pass. This is both annoying and problematic because it makes it h= arder to tell when a PR is introducing new flakiness. To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fai= ls on a PR (for a reason unrelated to the PR). Just provide a quick descri= ption of the failure -- e.g., "Flaky test: DagSchedulerSuite" or "Tests fai= led because 250m timeout expired", a link to the failed build, and include = the "Tests" component. If there's already a JIRA for the issue, just comme= nt with a link to the latest failure. I know folks don't always have time = to track down why a test failed, but this it at least helpful to someone el= se who, later on, is trying to diagnose when the issue started to find the = problematic code / test. If this seems like too high overhead, feel free to suggest alternative ways= to make the tests less flaky! -Kay --_000_BN6PR1401MB1987A924376328B3EDABD5CEA85B0BN6PR1401MB1987_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

I would recommend we just open JIRA's for unit tests based on module (co= re/ml/sql etc) and we fix this one module at a time, this at least kee= ps the number of unit tests needing fixing down to a manageable number.




From: Armin Braun <me@ob= rown.io>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout; dev@spark.apache.org
Subject: Re: File JIRAs for all flaky test failures
 
I think one thing that is contributing to this a lot too i= s the general issue of the tests taking up a lot of file descriptors (10k&#= 43; if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.= apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appe= ars to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to r= educe the resource consumption of these tests? 
Seems to me these can cause a lot of unpredictable behavior (making th= e reason for flaky tests hard to identify especially when there's timeouts = etc. involved) + they make it prohibitively expensive for many to test = locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal= <sxk1969@hotmai= l.com> wrote:

I was working on something to address this a while ago ht= tps://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in testing locally made things a lot more complicated to fi= x for each of the unit tests, should we resurface this JIRA again, I would = whole heartedly agree with the flakiness assessment of the unit tests.





From: K= ay Ousterhout <kayousterhout@gmail.com>
Sent: Wednesday, February 15, 2017 12:10 PM
To: dev@sp= ark.apache.org
Subject: File JIRAs for all flaky test failures
 
Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems mo= re common than not now that the tests need to be re-run at least once on PR= s before they pass.  This is both annoying and problematic because it = makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkin= s fails on a PR (for a reason unrelated to the PR).  Just provide a qu= ick description of the failure -- e.g., "Flaky test: DagSchedulerSuite= " or "Tests failed because 250m timeout expired", a link to the failed build, and include the "Tests" component.&n= bsp; If there's already a JIRA for the issue, just comment with a link to t= he latest failure.  I know folks don't always have time to track down = why a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose when the issue started to find t= he problematic code / test.

If this seems like too high overhead, feel free to suggest alternative= ways to make the tests less flaky!

-Kay

--_000_BN6PR1401MB1987A924376328B3EDABD5CEA85B0BN6PR1401MB1987_--