spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: File JIRAs for all flaky test failures
Date Thu, 16 Feb 2017 17:22:00 GMT
Josh's tool should give enough signal there already. I don't think we need
some manual process to document them. If you want to work on those that'd
be great. I bet you will get a lot of love because all developers hate
flaky tests.

On Thu, Feb 16, 2017 at 6:19 PM, Saikat Kanjilal <>

> I am specifically suggesting documenting a list of the the flaky tests and
> fixing them, that's all.  To organize the effort I suggested tackling this
> by module.  Your second sentence is what I was trying to gauge from the
> community before putting anymore effort into this.
> ------------------------------
> *From:* Sean Owen <>
> *Sent:* Thursday, February 16, 2017 8:45 AM
> *To:* Saikat Kanjilal;
> *Subject:* Re: File JIRAs for all flaky test failures
> I'm not sure what you're specifically suggesting. Of course flaky tests
> are bad and they should be fixed, and people do. Yes, some are pretty hard
> to fix because they are rarely reproducible if at all. If you want to fix,
> fix; there's nothing more to it.
> I don't perceive flaky tests to be a significant problem. It has gone from
> bad to occasional over the past year in my anecdotal experience.
> On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal <>
> wrote:
>> I'd just like to follow up again on this thread, should we devote some
>> energy to fixing unit tests based on module, there wasn't much interest in
>> this last time but given the nature of this thread I'd be willing to deep
>> dive into this again with some help.
>> ------------------------------
>> *From:* Saikat Kanjilal <>
>> *Sent:* Wednesday, February 15, 2017 6:12 PM
>> *To:* Josh Rosen
>> *Cc:* Armin Braun; Kay Ousterhout;
>> *Subject:* Re: File JIRAs for all flaky test failures
>> The issue was not with a lack of tooling, I used the url you are
>> describing below to drill down to the exact test failure/stack trace, the
>> problem was that my builds would work like a charm locally but fail with
>> these errors on Jenkins, this was the whole challenge in fixing the unit
>> tests, it was rare (if ever) where I would be able to replicate test
>> failures locally.
>> Sent from my iPhone
>> On Feb 15, 2017, at 5:40 PM, Josh Rosen <> wrote:
>> A useful tool for investigating test flakiness is my Jenkins Test
>> Explorer service, running at
>> This has some useful timeline views for debugging flaky builds. For
>> instance, at
>> test-maven-hadoop-2.6 (may be slow to load) you can see this chart:
>> Here, each column represents a test run
>> and each row represents a test which failed at least once over the
>> displayed time period.
>> In that linked example screenshot you'll notice that a few columns have
>> grey squares indicating that tests were skipped but lack any red squares to
>> indicate test failures. This usually indicates that the build failed due to
>> a problem other than an individual test failure. For example, I clicked
>> into one of those builds and found that one test suite failed in test setup
>> because the previous suite had not properly cleaned up its SparkContext
>> (I'll file a JIRA for this).
>> You can click through the interface to drill down to reports on
>> individual builds, tests, suites, etc. As an example of an individual
>> test's detail page,
>> suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_
>> name=missing+checkpoint+block+fails+with+informative+message shows the
>> patterns of flakiness in a streaming checkpoint test.
>> Finally, there's an experimental "interesting new test failures" report
>> which tries to surface tests which have started failing very recently:
>> Specifically, entries
>> in this feed are test failures which a) occurred in the last week, b) were
>> not part of a build which had 20 or more failed tests, and c) were not
>> observed to fail in during the previous week (i.e. no failures from [2
>> weeks ago, 1 week ago)), and d) which represent the first time that the
>> test failed this week (i.e. a test case will appear at most once in the
>> results list). I've also exposed this as an RSS feed at
>> On Wed, Feb 15, 2017 at 12:51 PM Saikat Kanjilal <>
>> wrote:
>> I would recommend we just open JIRA's for unit tests based on module
>> (core/ml/sql etc) and we fix this one module at a time, this at least keeps
>> the number of unit tests needing fixing down to a manageable number.
>> ------------------------------
>> *From:* Armin Braun <>
>> *Sent:* Wednesday, February 15, 2017 12:48 PM
>> *To:* Saikat Kanjilal
>> *Cc:* Kay Ousterhout;
>> *Subject:* Re: File JIRAs for all flaky test failures
>> I think one thing that is contributing to this a lot too is the general
>> issue of the tests taking up a lot of file descriptors (10k+ if I run them
>> on a standard Debian machine).
>> There are a few suits that contribute to this in particular like
>> `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few
>> others, appears to consume a lot of fds.
>> Wouldn't it make sense to open JIRAs about those and actively try to
>> reduce the resource consumption of these tests?
>> Seems to me these can cause a lot of unpredictable behavior (making the
>> reason for flaky tests hard to identify especially when there's timeouts
>> etc. involved) + they make it prohibitively expensive for many to test
>> locally imo.
>> On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <>
>> wrote:
>> I was working on something to address this a while ago
>> but the difficulty in
>> testing locally made things a lot more complicated to fix for each of the
>> unit tests, should we resurface this JIRA again, I would whole heartedly
>> agree with the flakiness assessment of the unit tests.
>> [SPARK-9487] Use the same num. worker threads in Scala ...
>> <>
>> In Python we use `local[4]` for unit tests, while in Scala/Java we use
>> `local[2]` and `local` for some unit tests in SQL, MLLib, and other
>> components. If the ...
>> ------------------------------
>> *From:* Kay Ousterhout <>
>> *Sent:* Wednesday, February 15, 2017 12:10 PM
>> *To:*
>> *Subject:* File JIRAs for all flaky test failures
>> Hi all,
>> I've noticed the Spark tests getting increasingly flaky -- it seems more
>> common than not now that the tests need to be re-run at least once on PRs
>> before they pass.  This is both annoying and problematic because it makes
>> it harder to tell when a PR is introducing new flakiness.
>> To try to clean this up, I'd propose filing a JIRA *every time* Jenkins
>> fails on a PR (for a reason unrelated to the PR).  Just provide a quick
>> description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or
>> "Tests failed because 250m timeout expired", a link to the failed build,
>> and include the "Tests" component.  If there's already a JIRA for the
>> issue, just comment with a link to the latest failure.  I know folks don't
>> always have time to track down why a test failed, but this it at least
>> helpful to someone else who, later on, is trying to diagnose when the issue
>> started to find the problematic code / test.
>> If this seems like too high overhead, feel free to suggest alternative
>> ways to make the tests less flaky!
>> -Kay

View raw message