cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: massive spikes in read latency
Date Fri, 10 Jan 2014 01:37:32 GMT
> The spikes in latency don’t seem to be correlated to an increase in reads. The cluster’s
workload is usually handling a maximum workload of 4200 reads/sec per node, with writes being
significantly less, at ~200/sec per node. Usually it will be fine with this, with read latencies
at around 3.5-10 ms/read, but once or twice an hour the latencies on the 3 nodes will shoot
through the roof.
Could there be errant requests coming in from the app ? e.g. something asking for 1’000s
of columns ?
Or something that hits a row that has a lot of tombstones ? 

Take a look at nodetool cfhistograms to see if you have any outlier wide rows. Also the second
column, sstables, will tell you how many sstables were touched by reads which. High numbers,
above 4, let you know there are some wide rows out there.  

In 2.0 and later 1.2 releases nodetool cfstats will also include information about the number
tombstones touched in a read. 

Hope that helps. 

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 8/01/2014, at 2:15 am, Jason Wee <peichieh@gmail.com> wrote:

>     /**
>      * Verbs it's okay to drop if the request has been queued longer than the request
timeout.  These
>      * all correspond to client requests or something triggered by them; we don't want
to
>      * drop internal messages like bootstrap or repair notifications.
>      */
>     public static final EnumSet<Verb> DROPPABLE_VERBS = EnumSet.of(Verb.BINARY,
>                                                                    Verb._TRACE,
>                                                                    Verb.MUTATION,
>                                                                    Verb.READ_REPAIR,
>                                                                    Verb.READ,
>                                                                    Verb.RANGE_SLICE,
>                                                                    Verb.PAGED_RANGE,
>                                                                    Verb.REQUEST_RESPONSE);
> 
> 
> The short term solution would probably increase the timeout in your yaml file but i suggest
you get the monitoring graphs (ping internode, block io) ready so it will give better indication
which might be the exact problem.
> 
> Jason
> 
> 
> On Tue, Jan 7, 2014 at 2:30 AM, Blake Eggleston <blake@shift.com> wrote:
> That’s a good point. CPU steal time is very low, but I haven’t observed internode
ping times during one of the peaks, I’ll have to check that out. Another thing I’ve noticed
is that cassandra starts dropping read messages during the spikes, as reported by tpstats.
This indicates that there’s too many queries for cassandra to handle. However, as I mentioned
earlier, the spikes aren’t correlated to an increase in reads.
> 
> On Jan 5, 2014, at 3:28 PM, Blake Eggleston <blake@shift.com> wrote:
> 
> > Hi,
> >
> > I’ve been having a problem with 3 neighboring nodes in our cluster having their
read latencies jump up to 9000ms - 18000ms for a few minutes (as reported by opscenter), then
come back down.
> >
> > We’re running a 6 node cluster, on AWS hi1.4xlarge instances, with cassandra reading
and writing to 2 raided ssds.
> >
> > I’ve added 2 nodes to the struggling part of the cluster, and aside from the latency
spikes shifting onto the new nodes, it has had no effect. I suspect that a single key that
lives on the first stressed node may be being read from heavily.
> >
> > The spikes in latency don’t seem to be correlated to an increase in reads. The
cluster’s workload is usually handling a maximum workload of 4200 reads/sec per node, with
writes being significantly less, at ~200/sec per node. Usually it will be fine with this,
with read latencies at around 3.5-10 ms/read, but once or twice an hour the latencies on the
3 nodes will shoot through the roof.
> >
> > The disks aren’t showing serious use, with read and write rates on the ssd volume
at around 1350 kBps and 3218 kBps, respectively. Each cassandra process is maintaining 1000-1100
open connections. GC logs aren’t showing any serious gc pauses.
> >
> > Any ideas on what might be causing this?
> >
> > Thanks,
> >
> > Blake
> 
> 


Mime
View raw message