spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saikat Kanjilal <>
Subject Re: File JIRAs for all flaky test failures
Date Wed, 15 Feb 2017 20:51:39 GMT
I would recommend we just open JIRA's for unit tests based on module (core/ml/sql etc) and
we fix this one module at a time, this at least keeps the number of unit tests needing fixing
down to a manageable number.

From: Armin Braun <>
Sent: Wednesday, February 15, 2017 12:48 PM
To: Saikat Kanjilal
Cc: Kay Ousterhout;
Subject: Re: File JIRAs for all flaky test failures

I think one thing that is contributing to this a lot too is the general issue of the tests
taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine).
There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite`
which, like a few others, appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption
of these tests?
Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests
hard to identify especially when there's timeouts etc. involved) + they make it prohibitively
expensive for many to test locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <<>>

I was working on something to address this a while ago
but the difficulty in testing locally made things a lot more complicated to fix for each of
the unit tests, should we resurface this JIRA again, I would whole heartedly agree with the
flakiness assessment of the unit tests.

[SPARK-9487] Use the same num. worker threads in Scala ...<><>
In Python we use `local[4]` for unit tests, while in Scala/Java we use `local[2]` and `local`
for some unit tests in SQL, MLLib, and other components. If the ...

From: Kay Ousterhout <<>>
Sent: Wednesday, February 15, 2017 12:10 PM
Subject: File JIRAs for all flaky test failures

Hi all,

I've noticed the Spark tests getting increasingly flaky -- it seems more common than not now
that the tests need to be re-run at least once on PRs before they pass.  This is both annoying
and problematic because it makes it harder to tell when a PR is introducing new flakiness.

To try to clean this up, I'd propose filing a JIRA *every time* Jenkins fails on a PR (for
a reason unrelated to the PR).  Just provide a quick description of the failure -- e.g., "Flaky
test: DagSchedulerSuite" or "Tests failed because 250m timeout expired", a link to the failed
build, and include the "Tests" component.  If there's already a JIRA for the issue, just comment
with a link to the latest failure.  I know folks don't always have time to track down why
a test failed, but this it at least helpful to someone else who, later on, is trying to diagnose
when the issue started to find the problematic code / test.

If this seems like too high overhead, feel free to suggest alternative ways to make the tests
less flaky!


View raw message