cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Winter <>
Subject Re: performance problems on new cluster
Date Fri, 12 Aug 2011 00:48:56 GMT

> Is there a reason you are using the trunk and not one of the tagged 
> releases? Official releases are a lot more stable than the trunk.
Yes, as we are using a combination of Ec2 and colo servers we are 
needing to use broadcast_address from CASSANDRA-2491.  The patch that is 
associated with that JIRA does not apply cleanly against 0.8 so this is 
why we are using trunk.

>> 1) thrift timeouts & general degraded response times
> For read or writes ? What sort of queries are you running ? Check the 
> local latency on each node using cfstats and cfhistogram, and a bit of 
> iostat 
> What 
> does nodetool tpstats say, is there a stage backing up?
> If the local latency is OK look at the cross DC situation. What CL are 
> you using? Are nodes timing out waiting for nodes in other DC's ?

iostat doesn't show a request queue bottleneck.  The timeouts we are 
seeing is for reads.  The latency on the nodes I have temporarily used 
for reads is around 2-45ms.  The next token in the ring at an alternate 
DC is showing ~4ms with everything else around 0.05ms.  tpstats desn't 
show any active/pending.  Reads are at CL.ONE & Writes using CL.ANY

>> 2) *lots* of exception errors, such as:
> Repair is trying to run on a response which is a digest response, this 
> should not be happening. Can you provide some more info on the type of 
> query you are running ?
The query being run is  get cf1['user-id']['seg']

>> 3) ring imbalances during a repair (refer to the above nodetool ring 
>> output)
> You may be seeing this
> I think it's a mistake that is it marked as resolved.
What can I do in regards to confirming this issue is still outstanding 
and/or we are affected by it?

>> 4) regular failure detection when any node does something only 
>> moderately stressful, such as a repair or are under light load etc. 
>> but the node itself thinks it is fine.
> What version are you using ?
Version of failure detection?  I've not seen anything on this so I 
suspect this is the default.


View raw message