cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Witschey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10730) periodic timeout errors in dtest
Date Tue, 24 Nov 2015 17:35:11 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024925#comment-15024925
] 

Jim Witschey commented on CASSANDRA-10730:
------------------------------------------

Good idea, thanks. I've added {{netstat}} debugging on this dtest branch:

https://github.com/mambocab/cassandra-dtest/tree/improve-timeout-debugging

Which I'm running on CassCI here:

http://cassci.datastax.com/view/Dev/view/mambocab/job/mambocab-cassandra-3.0-dtest/3/

Will merge if it runs a couple times without breaking anything. I've also scraped and analyzed
data on CassCI runs containing these connection timeout failures here:

http://nbviewer.ipython.org/github/mambocab/scrape_jenkins/blob/master/analysis.ipynb

High-level summary:

* We very rarely see timeout-related failures pre-3.0, so this issue is unlikely to be entirely
environmental. There are 7 failures on {{cassandra-2.2}} and none on 2.1.
* On 3.0+, most jobs (over 50%) of jobs contain 0 failures resulting from this timeout.
* On 3.0+, jobs with any connection timeout failures contain at least 20. This indicates that
the test is at least partly environmental.
* Some tests are more prone to failures than others -- most tests have failed this way 0 times,
where others have done so as many as 7 or 8 times. The modules containing the worst offenders
seem to be {{thrift_tests}}, {{auth_roles_test}}, and {{auth_test}}.
* Looking at a few builds that contain these failures, it looks like the tests that fail this
way are localized to a given ec2 worker. This is another indication that part of the problem
is environmental. I'm not as confident about this conclusion as I am about the others, however.
The way that tests are distributed among the workers makes all the {{auth_roles}} tests get
distributed to the same worker. Those tests may just be particularly prone to timeouts (they
do seem to be), and grouping them together makes it appear that failures are localized to
particular workers. However, there are runs where timeout failures are distributed across
a small number of workers, including failures in tests that are not {{auth_roles}} tests,
so this conclusion does still have some support.

> periodic timeout errors in dtest
> --------------------------------
>
>                 Key: CASSANDRA-10730
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10730
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jim Witschey
>            Assignee: Jim Witschey
>
> Dtests often fail with connection timeout errors. For example:
> http://cassci.datastax.com/job/cassandra-3.1_dtest/lastCompletedBuild/testReport/upgrade_tests.cql_tests/TestCQLNodes3RF3/deletion_test/
> {code}
> ('Unable to connect to any servers', {'127.0.0.1': OperationTimedOut('errors=Timed out
creating connection (10 seconds), last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> https://github.com/riptano/cassandra-dtest/pull/663
> It doesn't look like this has improved things:
> http://cassci.datastax.com/view/cassandra-3.0/job/cassandra-3.0_dtest/363/testReport/
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing this way has
increased (it feels like it has). From there we can bisect over the dtests, ccm, or C*, depending
on what looks like the source of the problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes start but
don't successfully make the CQL port available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message