cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramzi Rabah <>
Subject Re: TimedOutException
Date Fri, 18 Dec 2009 00:43:58 GMT
Ok I believe the problem is when I was upgrading to a newer build of
cassandra, I was upgrading the servers one by one by restarting them.
So at one point of time I had some nodes that were 2 days older than
the others, and it seems to have caused the inter-node messaging to go

I stopped all the nodes at the same time, and restarted all of them,
and seems like the problem is fixed.

On Thu, Dec 17, 2009 at 8:55 AM, Ramzi Rabah <> wrote:
> I added some debugging code to capture the time a read takes
> (getColumnFamily) and the time the road trip weakRemoteRead takes.
> The time it takes to read columns is negligible, so it doesn't seem a
> problem with getColumnFamily. The time it takes for weakRemoteRead
> however is > 5 seconds in some cases. So looking at some more
> debugging output,
> the log indicates that the packets are in the process of being sent by
> weakRemoteRead to the correct target node, but for some reason, the
> target node does not have any reference
> in the log that it handled the get at all.
> Couple other things to note:
> 1- I restarted the nodes one after another, while there was traffic
> going to them. Don't know if that will throw off cassandra or that the
> whole thing is a network congestion problem?
> 2- Read stats on the keyspace level indicate NaN value for Read
> latency which seems like a bug?
> Thanks
> Ramzi
> On Wed, Dec 16, 2009 at 12:07 PM, Jonathan Ellis <> wrote:
>> On Wed, Dec 16, 2009 at 12:46 PM, Ramzi Rabah <> wrote:
>>> We are observing increasing number of TimedOutExceptions in cassandra
>>> 0.5 trunk although the load seems fairly low (about 400 reads/writes
>>> per second).
>>> cfstats reports that operations are taking less than 2 ms on average.
>>> 2 Things I have noticed looking at the source code.
>>> 1- TimedOutExceptions are silently swallowed by Cassandra and not
>>> reported in the logs even at debug level
>> It's reported to the client.  Hardly "swallowed" :)
>>> 2- readstats does not account for these long time running queries that
>>> time out.
>> Right.  But the CF-level stats do.
>>> I'm wondering, what could be causing the system to go haywire like
>>> this?
>> Hard to say without more information.  One shot in the dark is that
>> get_key_range is a major offender sometimes, as well as workloads that
>> do lots of deletes + re-inserts for the same keys.
>> -Jonathan

View raw message