zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: making CI a more pleasant experience
Date Sun, 03 May 2015 19:53:21 GMT
Hi Raúl,

Thanks for starting this thread.  Flaky CI is a big challenge for us in
Hadoop too.  I can speak to some things that we've done to try to improve
it.  We're still struggling, but some of these have helped improve the
situation.

We recently rewrote our test-patch.sh with a lot of nice new
functionality.  (Credit goes to Allen Wittenauer.)

https://issues.apache.org/jira/browse/HADOOP-11746


The release notes field in the jira describes the new functionality.
There is a lot to take in there, so here is a summary of my personal
favorites:

1. It can run against any branch of the codebase, not just trunk, by
following a simple naming convention that includes the branch name when
uploading your patch.
2. It has some smarts to try to minimize execution time.  For example, if
the patch only changes shell scripts or documentation, then it assumes
there is no need to run JUnit tests.
3. It eliminated race conditions during concurrent test-patch.sh runs
caused by storing state in shared local files.  The ZooKeeper
test-patch.sh appears to be a fork of an older version of the Hadoop
script, so maybe the ZooKeeper script is subject to similar race
conditions.
4. It has some hooks for pluggability, making it easier to add more custom
checks.

We could explore porting HADOOP-11746 over to ZooKeeper.  It won't be as
simple as copying it right over to the ZooKeeper repo, because some of the
logic is specific to the Hadoop repo.

We've also had trouble with flaky tests.  Unfortunately, these often
require a ton of engineering time to fix.  The typical root causes I've
seen are:

1. Tests start servers bound to hard-coded port numbers, so if multiple
test runs execute concurrently, then one of them will get a bind
exception.  The solution is always to bind to an ephemeral port in test
code.
2. Tests do not do proper resource cleanup.  This can manifest as file
descriptor leaks leading to hitting the open file descriptor limit, thread
leaks, or 2 background threads trying to do the same job and interfering
with one another.  File descriptor leaks are particularly nasty for test
runs on Windows, where the default file locking behavior can prevent
subsequent tests from using a working directory for test data.  The
solution is to track these down and use try-finally, try-with-resources,
JUnit @After etc. to ensure clean-up.
3. Tests are non-deterministic, such as by hard-coding a sleep time to
wait for an asynchronous action to complete.  The solutions usually
involve providing hooks into lower-layer logic, such as to receive a
callback from the asynchronous action, so that the test can be
deterministic.
4. Tests hard-code a file path separator to '/', which doesn't work
correctly on Windows.  The code fixes for this usually are obvious once
you spot the problem.

It can take a long time to track these down and fix them.  To help people
iterate faster, I've proposed another test-patch.sh enhancement that would
allow the contributor to request that only specific tests are run on a
patch.

https://issues.apache.org/jira/browse/HADOOP-11895

This would help engineers get quicker feedback, especially in the event
that a test failure only repros on the Jenkins hosts.

In my experience (again primarily on Hadoop), it's much more often that we
see flaky tests rather than bad Jenkins hosts.  When there is a bad host,
it's usually pretty obvious.  Each Jenkins run reports the host that ran
the job, so we can identify a trend of a particular problem happening on a
particular host.

--Chris Nauroth




On 5/3/15, 10:28 AM, "Raúl Gutiérrez Segalés" <rgs@itevenworks.net> wrote:

>Hi all,
>
>This has probably come up before but do we have any thoughts on making CI
>better? Is the problem jenkins? Is it flaky tests? Bad CI workers? All of
>the above?
>
>I see we waste loads of time with trivial (or unrelated to the actual
>failures) patches triggering failed builds all the time. I'd like to spend
>some time improving our experience here, but would love some
>pointers/thoughts.
>
>Ideas?
>
>
>Cheers,
>-rgs


Mime
View raw message