cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From daemeon reiydelle <daeme...@gmail.com>
Subject Re: Debugging high tail read latencies (internal timeout)
Date Thu, 07 Jul 2016 19:00:45 GMT
Hmm. Would you mind looking at your network interface (appropriate netstat
commands). if I am right you will be seeing packet errors, drops, retries,
packet out of window receives, etc.

What you may be missing is that you reported zero DROPPED latency. Not mean
LATENCY. Check your netstats. ANY VALUE CHANGE IS BAD (except total
read/write byte counts). If your network guys say otherwise, escalate to
someone who undertands tcp retry and sliding window.



*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Thu, Jul 7, 2016 at 11:35 AM, Bryan Cheng <bryan@blockcypher.com> wrote:

> Hi Nimi,
>
> My suspicions would probably lie somewhere between GC and large partitions.
>
> The first tool would probably be a trace but if you experience full client
> timeouts from dropped messages you may find it hard to find the issue. You
> can try running the trace with cqlsh's timeouts cranked all the way against
> the local node with CL=ONE to try to force the local machine to answer.
>
> What does nodetool tpstats report for dropped message counts? Are they
> very high? Primarily restricted to READ, or including MUTATION, etc. ?
>
> Are there specific PK's that trigger this behavior, either all the time or
> more consistently? That would finger either very large partition sizes or
> potentially bad hardware on a node. cfhistograms will show you various
> percentile partition sizes and your max as well.
>
> GC should be accessible via JMX and also you should have GCInspector logs
> in cassandra/system.log that should give you per-collection breakdowns.
>
> --Bryan
>
>
> On Wed, Jul 6, 2016 at 6:22 PM, Nimi Wariboko Jr <nimi@channelmeter.com>
> wrote:
>
>> Hi,
>>
>> I've begun experiencing very high tail latencies across my clusters.
>> While Cassandra's internal metrics report <1ms read latencies, measuring
>> responses from within the driver in my applications (roundtrips of
>> query/execute frames), have 90% round trip times of up to a second for very
>> basic queries (SELECT a,b FROM table WHERE pk=x).
>>
>> I've been studying the logs to try and get a handle on what could be
>> going wrong. I don't think there are GC issues, but the logs mention
>> dropped messages due to timeouts while the threadpools are nearly empty -
>>
>> https://gist.github.com/nemothekid/28b2a8e8353b3e60d7bbf390ed17987c
>>
>> Relevant line:
>> REQUEST_RESPONSE messages were dropped in last 5000 ms: 1 for internal
>> timeout and 0 for cross node timeout. Mean internal dropped latency: 54930
>> ms and Mean cross-node dropped latency: 0 ms
>>
>> Are there any tools I can use to start to understand what is causing
>> these issues?
>>
>> Nimi
>>
>>
>

Mime
View raw message