cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Witschey (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10730) periodic timeout errors in dtest
Date Tue, 01 Dec 2015 19:30:10 GMT


Jim Witschey commented on CASSANDRA-10730:

I've run the dtests in a couple diagnostic ways here. First off: I've run the normal cassandra-3.0
dtest job on m3.2xlarge instances instead of xlarge instances since last Wednesday:

Since then, I haven't seen any connection timeouts on that job. There's no guarantee that
this will continue to hold, but going from 4 vCPUs/15 GiB to 8 vCPUs/30 GiB has prevented
timeouts so far.

I've also got a custom dtest branch that prints debug information when an attempt to connect
times out. The branch is here:

And here's an example of the tests running and producing that debug output:

This is one of the tests that times out:

The output that indicates it timed out is "local variable 'session' referenced before assignment"
rather than the usual timeout output because of a bug in my debugging code. I believe the
output collected is still useful.

One pattern I've found is that, after the timed-out connections, there's always a Java process
owned by the automaton user using ~100% CPU in the output of {{top}}. I'm running more builds
to confirm that this is Cassandra, but I'd be surprised if it weren't.

Is this information -- info about the patterns that the failures follow, and the {{jstack}},
{{netstat}}, and {{top}} output -- helpful? [~aweisberg] Do you have any thoughts? I'm not
sure what to make of it. If C* doesn't make the CQL port available for minutes under certain
circumstances -- like running with 15G memory -- that seems like a bug to me.

> periodic timeout errors in dtest
> --------------------------------
>                 Key: CASSANDRA-10730
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jim Witschey
>            Assignee: Jim Witschey
> Dtests often fail with connection timeout errors. For example:
> {code}
> ('Unable to connect to any servers', {'': OperationTimedOut('errors=Timed out
creating connection (10 seconds), last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> It doesn't look like this has improved things:
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing this way has
increased (it feels like it has). From there we can bisect over the dtests, ccm, or C*, depending
on what looks like the source of the problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes start but
don't successfully make the CQL port available.

This message was sent by Atlassian JIRA

View raw message