cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10730) periodic timeout errors in dtest
Date Tue, 01 Dec 2015 19:58:10 GMT


Ariel Weisberg commented on CASSANDRA-10730:

It definitely sounds like a bug in the server.

I looked at all the threads and none of them are look like they are actually doing anything
that would explain the CPU utilization. What size heap are the nodes started with during dtests?
The RSS of the Java process is 713 megabytes which makes me wonder if it's caught spinning
in GC against a 512 megabyte heap.

The thing to get then is a heap dump with jmap, collect GC logs, or collect a flight recording.
A flight recording would show what the active threads are if we are wrong about it being a
GC issue. I think it has to be GC because the socket is bound and the thread is listening.
It probably just can't run because the JVM is wedged.

What's strange is you say it goes away with a bigger instance. Maybe more memory leads to
a bigger default heap size from the JVM if we aren't specifying it?

> periodic timeout errors in dtest
> --------------------------------
>                 Key: CASSANDRA-10730
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jim Witschey
>            Assignee: Jim Witschey
> Dtests often fail with connection timeout errors. For example:
> {code}
> ('Unable to connect to any servers', {'': OperationTimedOut('errors=Timed out
creating connection (10 seconds), last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> It doesn't look like this has improved things:
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing this way has
increased (it feels like it has). From there we can bisect over the dtests, ccm, or C*, depending
on what looks like the source of the problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes start but
don't successfully make the CQL port available.

This message was sent by Atlassian JIRA

View raw message