> Is there a reason you are using the trunk and not one of the tagged
> releases? Official releases are a lot more stable than the trunk.
>
Yes, as we are using a combination of Ec2 and colo servers we are
needing to use broadcast_address from CASSANDRA-2491. The patch that is
associated with that JIRA does not apply cleanly against 0.8 so this is
why we are using trunk.
>> 1) thrift timeouts & general degraded response times
> For read or writes ? What sort of queries are you running ? Check the
> local latency on each node using cfstats and cfhistogram, and a bit of
> iostat
> http://spyced.blogspot.com/2010/01/linux-performance-basics.html What
> does nodetool tpstats say, is there a stage backing up?
>
> If the local latency is OK look at the cross DC situation. What CL are
> you using? Are nodes timing out waiting for nodes in other DC's ?
iostat doesn't show a request queue bottleneck. The timeouts we are
seeing is for reads. The latency on the nodes I have temporarily used
for reads is around 2-45ms. The next token in the ring at an alternate
DC is showing ~4ms with everything else around 0.05ms. tpstats desn't
show any active/pending. Reads are at CL.ONE & Writes using CL.ANY
>
>> 2) *lots* of exception errors, such as:
> Repair is trying to run on a response which is a digest response, this
> should not be happening. Can you provide some more info on the type of
> query you are running ?
>
The query being run is get cf1['user-id']['seg']
>> 3) ring imbalances during a repair (refer to the above nodetool ring
>> output)
> You may be seeing this
> https://issues.apache.org/jira/browse/CASSANDRA-2280
> I think it's a mistake that is it marked as resolved.
>
What can I do in regards to confirming this issue is still outstanding
and/or we are affected by it?
>> 4) regular failure detection when any node does something only
>> moderately stressful, such as a repair or are under light load etc.
>> but the node itself thinks it is fine.
> What version are you using ?
>
Version of failure detection? I've not seen anything on this so I
suspect this is the default.
Thanks,
Anton
|