cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10730) periodic timeout errors in dtest
Date Wed, 02 Dec 2015 18:57:11 GMT


Ariel Weisberg commented on CASSANDRA-10730:

Seems like the netstat output is not there anymore? I looked at the server log from one of
the failures and it has started listening on the socket.

A flight recording might have more visibility into something the point in time snapshots are

I am starting to wonder if this isn't some kind of client library or protocol issue. I kind
of want to dig into what the client library is experiencing when it says it can't connect
to the server.

The first quick debug step would be to connect to the port and write some garbage and see
if you can get a protocol error back from the server. If you connect and  get a protocol error
back it means that the server 90% works. I looked at the code and it tries to do that. I wouldn't
parse the error I would just consume response data until the socket closes with a few minute

There is still the important clue of the CPU utilization and the fact that this goes away
when you move to a bigger instance. A bigger instance means more CPU and more memory. But
we have visibility into CPU and memory and nothing seems particularly wrong. There should
be a smoking gun here but we aren't seeing it.

I did notice that CPU utilization isn't always reported by top, but top isn't a great way
to monitor.

> periodic timeout errors in dtest
> --------------------------------
>                 Key: CASSANDRA-10730
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jim Witschey
>            Assignee: Jim Witschey
> Dtests often fail with connection timeout errors. For example:
> {code}
> ('Unable to connect to any servers', {'': OperationTimedOut('errors=Timed out
creating connection (10 seconds), last_host=None',)})
> {code}
> We've merged a PR to increase timeouts:
> It doesn't look like this has improved things:
> Next steps here are
> * to scrape Jenkins history to see if and how the number of tests failing this way has
increased (it feels like it has). From there we can bisect over the dtests, ccm, or C*, depending
on what looks like the source of the problem.
> * to better instrument the dtest/ccm/C* startup process to see why the nodes start but
don't successfully make the CQL port available.

This message was sent by Atlassian JIRA

View raw message