cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew O'Riordan" <>
Subject Re: ALL range query monitors failing frequently
Date Wed, 28 Jun 2017 14:04:43 GMT
Hi Kurt

Thanks for the response.  Few comments in line:

On Wed, Jun 28, 2017 at 1:17 PM, kurt greaves <> wrote:

> You're correct in that the timeout is only driver side. The server will
> have its own timeouts configured in the cassandra.yaml file.
Yup, OK.

I suspect either that you have a node down in your cluster (or 4),
Nope, that’s not what is happening as a) we have monitoring on all nodes,
b) there is nothing in the logs.

> or your queries are gradually getting slower.
Perhaps, but we have query time metrics that don’t seem to indicate any
obvious issues.  See the attached metrics from the last 12 hours for quorum

> This kind of aligns with the slow query statements in your logs. Are you
> making changes/updates to the partitions that you are querying?

> It could be that the partitions are now spread across multiple SSTables
> and thus slowing things down. You should perform a trace to get a better
> idea of the issue.
If I run a CONSISTENCY QUORUM | ALL range query, it is visually very slow
using cqlsh and unfortunately results in a trace failure: “Statement trace
did not complete within 10 seconds”.

A hacky workaround would be to increase your read timeouts server side
> (read_timeout_in_ms), however this will mask underlying data model issues.
Yup, I certainly don’t like the idea of that.

I’m interested in what you said about the partitions being spread across
multiple SSTables.  Any pointers on what to look for there?

I then wondered if perhaps a range query is really just not a good idea,
even if only for monitoring purposes.  I tried querying for just one row
with the ID specified i.e. something like SELECT * from keyspace.table
where id = 123;  It was still incredibly slow (with CONSISTENCY ALL) and
failed a few times to generate a trace, but finally resulted in a trace
that can be seen at

The worse offender seemed to be, so I ran the same query on
that instance itself to see if it is under load / servicing requests slowly
and it’s not. See

So as far as I can tell, it looks like there may be some issue with nodes
communicating with each other perhaps, but the logs don’t reveal much.
Where to now?



Matthew O'Riordan
CEO who codes
Ably - simply better realtime <>

*Ably News: Ably push notifications have gone live

View raw message