cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Dejanovski <a...@thelastpickle.com>
Subject Re: Regular dropped READ messages
Date Tue, 06 Jun 2017 12:46:20 GMT
Hi Vincent,

dropped messages are indeed common in case of long GC pauses.
Having 4s to 6s pauses is not normal and is the sign of an unhealthy
cluster. Minor GCs are usually faster but you can have long ones too.

If you can share your hardware specs along with your current GC settings
(CMS or G1, heap size, young gen size) and a distribution of GC pauses
(rate of minor GCs, average and max duration of GCs) we could try to help
you tune your heap settings.
You can activate full GC logging which could help in fine tuning
MaxTenuringThreshold and survivor space sizing.

You should also check for max partition sizes and number of SSTables
accessed per read. Run nodetool cfstats/cfhistograms on your tables to get
both. p75 should be less or equal to 4 in number of SSTables  and you
shouldn't have partitions over... let's say 300 MBs. Partitions > 1GB are a
critical problem to address.

Other things to consider are :
Do you read from a single partition for each query ?
Do you use collections that could spread over many SSTables ?
Do you use batches for writes (although your problem doesn't seem to be
write related) ?
Can you share the queries from your scheduled selects and the data model ?

Cheers,


On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann <me@vrischmann.me> wrote:

> Hi,
>
> we have a cluster of 11 nodes running Cassandra 2.2.9 where we regularly
> get READ messages dropped:
>
> > READ messages were dropped in last 5000 ms: 974 for internal timeout and
> 0 for cross node timeout
>
> Looking at the logs, some are logged at the same time as Old Gen GCs.
> These GCs all take around 4 to 6s to run. To me, it's "normal" that these
> could cause reads to be dropped.
> However, we also have reads dropped without Old Gen GCs occurring, only
> Young Gen.
>
> I'm wondering if anyone has a good way of determining what the _root_
> cause could be. Up until now, the only way we managed to decrease load on
> our cluster was by guessing some stuff, trying it out and being lucky,
> essentially. I'd love a way to make sure what the problem is before
> tackling it. Doing schema changes is not a problem, but changing stuff
> blindly is not super efficient :)
>
> What I do see in the logs, is that these happen almost exclusively when we
> do a lot of SELECT.  The time logged almost always correspond to times
> where our schedules SELECTs are happening. That narrows the scope a little,
> but still.
>
> Anyway, I'd appreciate any information about troubleshooting this scenario.
> Thanks.
>
-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Mime
View raw message