zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: making CI a more pleasant experience
Date Mon, 04 May 2015 03:53:36 GMT
Hongchao recently identified a long standing issue in sync that originally
looked like it might be a flakey test. While flakeys are commonly a
test/host issue it's also important to review the issues carefully lest we
ignore real problems. I've found that reviewing the jenkins results from
past runs (I used to try and do the analysis manually every quarter or so)
would identify some low hanging fruit. I then would create jiras to
triage/fix. I've found that to be the most successful.

Sounds like we'd need to identify what the current pain points are, then
work up a solution. Introducing changes (such as the hadoop improvements
mentioned) might be for the good, but it might not really solve our core
problem(s).

Looking through a number of the recent precommit failures it seems like
real issues to me, e.g. the author was missing testing for the change. Some
are failed tests that might be flakey, I didn't do a full analysis. Perhaps
a tool to help with the analysis would be useful?

Trunk tests seem pretty stable - where I do see issues recently it's mostly
a single test that seems to be intermittently failing:
ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig
I see someone opened a jira then closed it, perhaps it should be re-opened:
https://issues.apache.org/jira/browse/ZOOKEEPER-2080

Patrick

On Sun, May 3, 2015 at 12:53 PM, Chris Nauroth <cnauroth@hortonworks.com>
wrote:

> Hi Raúl,
>
> Thanks for starting this thread.  Flaky CI is a big challenge for us in
> Hadoop too.  I can speak to some things that we've done to try to improve
> it.  We're still struggling, but some of these have helped improve the
> situation.
>
> We recently rewrote our test-patch.sh with a lot of nice new
> functionality.  (Credit goes to Allen Wittenauer.)
>
> https://issues.apache.org/jira/browse/HADOOP-11746
>
>
> The release notes field in the jira describes the new functionality.
> There is a lot to take in there, so here is a summary of my personal
> favorites:
>
> 1. It can run against any branch of the codebase, not just trunk, by
> following a simple naming convention that includes the branch name when
> uploading your patch.
> 2. It has some smarts to try to minimize execution time.  For example, if
> the patch only changes shell scripts or documentation, then it assumes
> there is no need to run JUnit tests.
> 3. It eliminated race conditions during concurrent test-patch.sh runs
> caused by storing state in shared local files.  The ZooKeeper
> test-patch.sh appears to be a fork of an older version of the Hadoop
> script, so maybe the ZooKeeper script is subject to similar race
> conditions.
> 4. It has some hooks for pluggability, making it easier to add more custom
> checks.
>
> We could explore porting HADOOP-11746 over to ZooKeeper.  It won't be as
> simple as copying it right over to the ZooKeeper repo, because some of the
> logic is specific to the Hadoop repo.
>
> We've also had trouble with flaky tests.  Unfortunately, these often
> require a ton of engineering time to fix.  The typical root causes I've
> seen are:
>
> 1. Tests start servers bound to hard-coded port numbers, so if multiple
> test runs execute concurrently, then one of them will get a bind
> exception.  The solution is always to bind to an ephemeral port in test
> code.
> 2. Tests do not do proper resource cleanup.  This can manifest as file
> descriptor leaks leading to hitting the open file descriptor limit, thread
> leaks, or 2 background threads trying to do the same job and interfering
> with one another.  File descriptor leaks are particularly nasty for test
> runs on Windows, where the default file locking behavior can prevent
> subsequent tests from using a working directory for test data.  The
> solution is to track these down and use try-finally, try-with-resources,
> JUnit @After etc. to ensure clean-up.
> 3. Tests are non-deterministic, such as by hard-coding a sleep time to
> wait for an asynchronous action to complete.  The solutions usually
> involve providing hooks into lower-layer logic, such as to receive a
> callback from the asynchronous action, so that the test can be
> deterministic.
> 4. Tests hard-code a file path separator to '/', which doesn't work
> correctly on Windows.  The code fixes for this usually are obvious once
> you spot the problem.
>
> It can take a long time to track these down and fix them.  To help people
> iterate faster, I've proposed another test-patch.sh enhancement that would
> allow the contributor to request that only specific tests are run on a
> patch.
>
> https://issues.apache.org/jira/browse/HADOOP-11895
>
> This would help engineers get quicker feedback, especially in the event
> that a test failure only repros on the Jenkins hosts.
>
> In my experience (again primarily on Hadoop), it's much more often that we
> see flaky tests rather than bad Jenkins hosts.  When there is a bad host,
> it's usually pretty obvious.  Each Jenkins run reports the host that ran
> the job, so we can identify a trend of a particular problem happening on a
> particular host.
>
> --Chris Nauroth
>
>
>
>
> On 5/3/15, 10:28 AM, "Raúl Gutiérrez Segalés" <rgs@itevenworks.net> wrote:
>
> >Hi all,
> >
> >This has probably come up before but do we have any thoughts on making CI
> >better? Is the problem jenkins? Is it flaky tests? Bad CI workers? All of
> >the above?
> >
> >I see we waste loads of time with trivial (or unrelated to the actual
> >failures) patches triggering failed builds all the time. I'd like to spend
> >some time improving our experience here, but would love some
> >pointers/thoughts.
> >
> >Ideas?
> >
> >
> >Cheers,
> >-rgs
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message