doesn't seem to be the same, it looks like just less than 10% of the read traffic. the query i originally posted was one that we captured and used as an example.  every time i would run it at local_quorum, all, quorum... it would do a read repair.  the record hasn't been updated for a long time from an actual client write.

as i'm tracking read repair metrics and graphing overtime.  so looking at table level metrics

there is very little client level writes, but looking at table level write metrics, it follows the exact same pattern of reads.  also then matches the read repairs.  this is the only table and only cluster i've ever seen this type of behavior on.

image.png

image.png



On Wed, Oct 16, 2019 at 5:30 PM Jeff Jirsa <jjirsa@gmail.com> wrote:
The only way you're going to figure this is to run with tracing and find a key that is definitely being repaired multiple times.

Is it always the same instance? Is it random instances? 

You're suggesting blocking RR despite no mismatches, which basically implies something is digesting incorrectly. It's possible that something in the storage engine is wrong here, and it's worth finding out, because it may not impact you but may represent a real bug.

Would be great if you could turn on tracing.



On Wed, Oct 16, 2019 at 12:17 PM Patrick Lee <patrickclee0207@gmail.com> wrote:
we do have otc_coalescing_strategy, we did run into that long while back were we see better performance with this off.
and most recently, disk_access_mode to mmap_index_only
as we have a few clusters where we would experience a lot more disk IO causing high load, high cpu and so latencies were crazy high.  setting this to mmap_index_only we've seen a lot better overall performance. 

just haven't seen this constant rate of read repairs.



On Wed, Oct 16, 2019 at 12:57 PM ZAIDI, ASAD <az192g@att.com> wrote:

Wondering if you’ve  disabled  otc_coalescing_strategy  CASSANDRA-12676 since you’ve upgraded from 2.x?  also if you found luck by  increasing native_transport_max_threads  to address blocked NTRs (CASSANDRA-11363)?

~Asad

 

 

 

From: Patrick Lee [mailto:patrickclee0207@gmail.com]
Sent: Wednesday, October 16, 2019 12:22 PM
To: user@cassandra.apache.org
Subject: Re: Constant blocking read repair for such a tiny table

 

haven't really figured this out yet.  it's not a big problem but it is annoying for sure! the cluster was upgraded from 2.1.16 to 3.11.4.  now my only thing is i'm not sure if had this type of behavior before the upgrade.  i'm leaning toward a no based on my data but i'm just not 100% sure.  

 

just 1 table, out of all the ones on the cluster has this behavior. repair has been run few times via reaper.  even did a nodetool compact on the nodes (since this table is like 1GB per node..) . just don't see why there would be any inconsistency that would trigger read repair. 

 

any insight you may have would be appreciated!  the real thing that started this digging into the cluster was during some stress test application team complained about high latency (30ms at p98).  this cluster is oversized already for this use case with only 14GB of data per node, there is more than enough ram so all the data is basically cached in ram.  the only thing that stands out is this crazy read repair.  so this read repair may not be my root issue but definitely shouldn't be happening like this.

 

the vm's..

12 cores

82GB ram

1.2TB local ephemeral ssd's

 

attached the info from 1 of the nodes.

 

On Tue, Oct 15, 2019 at 2:36 PM Alain RODRIGUEZ <arodrime@gmail.com> wrote:

Hello Patrick,

 

Still in trouble with this? I must admit I'm really puzzled by your issue. I have no real idea of what's going on. Would you share with us the output of:

 

- nodetool status <keyspace>

- nodetool describecluster

- nodetool gossipinfo

- nodetool tpstats

 

Also you said the app is running for a long time, with no changes. What about Cassandra? Any recent operations?

 

I hope that with this information we might be able to understand better and finally be able to help.

 

-----------------------

Alain Rodriguez - alain@thelastpickle.com

France / Spain

 

The Last Pickle - Apache Cassandra Consulting

 

Le ven. 4 oct. 2019 à 00:25, Patrick Lee <patrickclee0207@gmail.com> a écrit :

this table was actually leveled compaction before, just changed it to size tiered yesterday while researching this.

 

On Thu, Oct 3, 2019 at 4:31 PM Patrick Lee <patrickclee0207@gmail.com> wrote:

its not really time series data.   and it's not updated very often, it would have some updates but pretty infrequent. this thing should be super fast, on avg it's like 1 to 2ms p99 currently but if they double - triple the traffic on that table latencies go upward to 20ms to 50ms.. the only odd thing i see is just that there are constant read repairs that follow the same traffic pattern on the reads, which shows constant writes on the table (from the read repairs), which after read repair or just normal full repairs (all full through reaper, never ran any incremental repair) i would expect it to not have any mismatches.  the other 5 tables they use on the cluster can have the same level traffic all very simple select from table by partition key which returns a single record

 

On Thu, Oct 3, 2019 at 4:21 PM John Belliveau <belliveau.john@gmail.com> wrote:

Hi Patrick,

 

Is this time series data? If so, I have run into issues with repair on time series data using the SizeTieredCompactionStrategy. I have had better luck using the TimeWindowCompactionStrategy.

 

John

 

Sent from Mail for Windows 10

 

From: Patrick Lee
Sent: Thursday, October 3, 2019 5:14 PM
To: user@cassandra.apache.org
Subject: Constant blocking read repair for such a tiny table

 

I have a cluster that is running 3.11.4 ( was upgraded a while back from 2.1.16 ).  what I see is a steady rate of read repair which is about 10% constantly on only this 1 table.  Repairs have been run (actually several times).  The table does not have a lot of writes to it so after repair, or even after a read repair I would expect it to be fine.  the reason i'm having to dig into this so much is for the fact that under a much large traffic load than their normal traffic, latencies are higher than the app team wants

 

I mean this thing is tiny, it's a 12x12 cluster but this 1 table is like 1GB per node on disk.

 

the application team is doing reads at LOCAL_QUORUM and I can simulate this on that cluster by running a query using quorum and/or local_quorum and in the trace can see every time running the query it comes back with a DigestMismatchException no matter how many times I run it. that record hasn't been updated by the application for several months.

 

repairs are scheduled and run every 7 days via reaper, recently in the past week this table has been repaired at least 3 times.  every time there are mismatches and data streams back and forth but yet still a constant rate of read repairs. 

 

curious if anyone has any recommendations to look info further or have experienced anything like this?

 

this node has been up for 24 hours.. this is the netstats for read repairs

Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 7481
Mismatch (Blocking): 11425375
Mismatch (Background): 17
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0           1232         0
Small messages                  n/a         0      395903678         0
Gossip messages                 n/a         0         603746         0

 

example of the schema... some modifications have been made to reduce read_reapair and speculative_retry while troubleshooting.. 

CREATE TABLE keyspace.table1 (

    item bigint,

    price int,

    start_date timestamp,

    end_date timestamp,

    created_date timestamp,

    cost decimal,

    list decimal,

    item_id int,

    modified_date timestamp,

    status int,

    PRIMARY KEY ((item, price), start_date, end_date)

) WITH CLUSTERING ORDER BY (start_date ASC, end_date ASC)

    AND read_repair_chance = 0.0

    AND dclocal_read_repair_chance = 0.0

    AND gc_grace_seconds = 864000

    AND bloom_filter_fp_chance = 0.01

    AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }

    AND comment = ''

    AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold' : 32, 'min_threshold' : 4 }

    AND compression = { 'chunk_length_in_kb' : 4, 'class' : 'org.apache.cassandra.io.compress.LZ4Compressor' }

    AND default_time_to_live = 0

    AND speculative_retry = 'NONE'

    AND min_index_interval = 128

    AND max_index_interval = 2048

    AND crc_check_chance = 1.0

    AND cdc = false

    AND memtable_flush_period_in_ms = 0;