cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: performance problems on new cluster
Date Fri, 12 Aug 2011 02:11:06 GMT
> 
> iostat doesn't show a request queue bottleneck.  The timeouts we are seeing is for reads.
 The latency on the nodes I have temporarily used for reads is around 2-45ms.  The next token
in the ring at an alternate DC is showing ~4ms with everything else around 0.05ms.  tpstats
desn't show any active/pending.  Reads are at CL.ONE & Writes using CL.ANY

OK, node latency is fine and you are using some pretty low consistency. You said NTS with
RF 2, is that RF 2 for each DC ? 

The steps below may help get an idea of whats going on…

1) use nodetool getendpoints to determine which replicas a key is.  
2) connect directly to one of the endpoints with the CLI, ensure CL is ONE and do your test
query. 
3) connect to another node in the same DC that is not a replica and do the same. 
4) connect to another node in a different DC and do the same 

Once you can repo it try turning up the logging not the coordinator to DEBUG you can do this
via JConsole. Look for these lines….

* Command/ConsistencyLevel is….
* reading data locally... or reading data from…
* reading digest locally… or reading digest for from…
* Read timeout:…

You'll also see some lines about receiving messages from other nodes.  

Hopefully you can get an idea of which nodes are involved in a failing query. Getting a thrift
TimedOutException on a read with CL ONE is pretty odd. 

> What can I do in regards to confirming this issue is still outstanding and/or we are
affected by it?
It's in 0.8 and will not be fixed. My unscientific approach was to repair a single CF at a
time, hoping that the differences would be smaller and less data would be streamed. 
Minor compaction should help squish things down. If you want to get more aggressive reduce
the min compaction threshold and trigger a minor compaction with nodetool flush.   

> Version of failure detection?  I've not seen anything on this so I suspect this is the
default.
Was asking so I could see if their were any fixed in Gossip or the FailureDetect that you
were missing. Check the CHANGES.txt file. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 12 Aug 2011, at 12:48, Anton Winter wrote:

> 
>> Is there a reason you are using the trunk and not one of the tagged releases? Official
releases are a lot more stable than the trunk.
>> 
> Yes, as we are using a combination of Ec2 and colo servers we are needing to use broadcast_address
from CASSANDRA-2491.  The patch that is associated with that JIRA does not apply cleanly against
0.8 so this is why we are using trunk.
> 
>>> 1) thrift timeouts & general degraded response times
>> For read or writes ? What sort of queries are you running ? Check the local latency
on each node using cfstats and cfhistogram, and a bit of iostat http://spyced.blogspot.com/2010/01/linux-performance-basics.html
What does nodetool tpstats say, is there a stage backing up?
>> 
>> If the local latency is OK look at the cross DC situation. What CL are you using?
Are nodes timing out waiting for nodes in other DC's ?
> 
> iostat doesn't show a request queue bottleneck.  The timeouts we are seeing is for reads.
 The latency on the nodes I have temporarily used for reads is around 2-45ms.  The next token
in the ring at an alternate DC is showing ~4ms with everything else around 0.05ms.  tpstats
desn't show any active/pending.  Reads are at CL.ONE & Writes using CL.ANY
> 
>> 
>>> 2) *lots* of exception errors, such as:
>> Repair is trying to run on a response which is a digest response, this should not
be happening. Can you provide some more info on the type of query you are running ?
>> 
> The query being run is  get cf1['user-id']['seg']
> 
> 
>>> 3) ring imbalances during a repair (refer to the above nodetool ring output)
>> You may be seeing this
>> https://issues.apache.org/jira/browse/CASSANDRA-2280
>> I think it's a mistake that is it marked as resolved.
>> 
> What can I do in regards to confirming this issue is still outstanding and/or we are
affected by it?
> 
>>> 4) regular failure detection when any node does something only moderately stressful,
such as a repair or are under light load etc. but the node itself thinks it is fine.
>> What version are you using ?
>> 
> Version of failure detection?  I've not seen anything on this so I suspect this is the
default.
> 
> 
> Thanks,
> Anton
> 


Mime
View raw message