From user-return-64563-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Wed Oct 16 19:16:56 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4CF3E180644 for ; Wed, 16 Oct 2019 21:16:56 +0200 (CEST) Received: (qmail 93489 invoked by uid 500); 16 Oct 2019 19:16:44 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 93444 invoked by uid 99); 16 Oct 2019 19:16:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Oct 2019 19:16:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B06421A33DB for ; Wed, 16 Oct 2019 19:16:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.451 X-Spam-Level: * X-Spam-Status: No, score=1.451 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=0.2, KAM_LINEPADDING=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 6C4w7oIHltUr for ; Wed, 16 Oct 2019 19:16:41 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::22d; helo=mail-oi1-x22d.google.com; envelope-from=patrickclee0207@gmail.com; receiver= Received: from mail-oi1-x22d.google.com (mail-oi1-x22d.google.com [IPv6:2607:f8b0:4864:20::22d]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id C2EF27E20A for ; Wed, 16 Oct 2019 19:16:40 +0000 (UTC) Received: by mail-oi1-x22d.google.com with SMTP id k9so21045919oib.7 for ; Wed, 16 Oct 2019 12:16:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=hZmMCvpB2NK2Sa98Jr6CtNKml1OwC7tRHav9yWr0IoM=; b=LxTK8/a0RW/i6A+bj3I+llK0C7ZmrII974jyhHL6QL+W+DY1H6x0p2VZonuEdV5Ifc V+FLv0/VNC0C5RVCaqQUG53OGXLGAiRwzoNwpCIuayl82e93S29/jLK3PxzGCDcHDTFQ 8647fivd43axb+MvWU3FDHD9XzGUfqvx6Qx6Ms8Kp96eaRQjj4z7355HHnd/0f+0W3NC eSEMRE7mdszKMIJ4PiTy22EgCw360c7q6yGzct2an56a+HtbmlOjx340Mrp+ILeJRc1o Gb1zM8osUTuQvsZ6T27KwEoSrVgzWQ2k0/ddntI0K7FTCmJ4vBDJvneIBechgR2urbcX IGkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=hZmMCvpB2NK2Sa98Jr6CtNKml1OwC7tRHav9yWr0IoM=; b=nYdakZPqfPLxnnMbH7UI6zE+vuJSQxPmas/LLhGXviY8QEjS1m/cz2oQV9llaRkCFc 8eZpg+l/N4pBv2Z5+pDElTK3qE/1Y4oo6EoYOonbBxqgr2lrG7BLO0kBxR7NPZPinLRL 5LowqoqxvWY5T7Iml2U5kWVYUeqCnjr6qKRz6i5Fyj4daHu9i/Iprb4X9Pn4VxRQyWuY byQbJQYPFmJN8VVkYloYpHDDg+xVrYUhvBp/5nTmNr33+TuxtmKqYoZ1QqnZw3IkZZ/X /06OKpVFkQsIipWRyPOO60jcmLP/eFa2mSWoaBgNdEEworUkk+EpymR6D6dFUQ0itepF dozw== X-Gm-Message-State: APjAAAW7drl1dlM7AfvDprUUtFePw7CdUh6y2HR8XQYB0epmq6bHiZkl buDv/XRratKieksEnn6yISijrL+gwxrsJDpPZhhEaw6Y X-Google-Smtp-Source: APXvYqy7XJIbHQYLjoZDnCIN3KXR2zCJvcZ5X0pI5NJD/r9zm5DwCxRSMQyqP6JBNZfYzMqZq6Fevt3RRtyC33eithc= X-Received: by 2002:aca:b841:: with SMTP id i62mr4507128oif.123.1571253393871; Wed, 16 Oct 2019 12:16:33 -0700 (PDT) MIME-Version: 1.0 References: <053247A8CBB6754B8345743B8F18D68D52639EA5@MOSTLS1MSGUSRFA.ITServices.sbc.com> In-Reply-To: <053247A8CBB6754B8345743B8F18D68D52639EA5@MOSTLS1MSGUSRFA.ITServices.sbc.com> From: Patrick Lee Date: Wed, 16 Oct 2019 14:16:21 -0500 Message-ID: Subject: Re: Constant blocking read repair for such a tiny table To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="00000000000064d9a005950bf099" --00000000000064d9a005950bf099 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable we do have otc_coalescing_strategy, we did run into that long while back were we see better performance with this off. and most recently, disk_access_mode to mmap_index_only as we have a few clusters where we would experience a lot more disk IO causing high load, high cpu and so latencies were crazy high. setting this to mmap_index_only we've seen a lot better overall performance. just haven't seen this constant rate of read repairs. On Wed, Oct 16, 2019 at 12:57 PM ZAIDI, ASAD wrote: > Wondering if you=E2=80=99ve disabled otc_coalescing_strategy CASSANDRA= -12676 > since you=E2=80= =99ve > upgraded from 2.x? also if you found luck by increasing > native_transport_max_threads to address blocked NTRs (CASSANDRA-11363)? > > ~Asad > > > > > > > > *From:* Patrick Lee [mailto:patrickclee0207@gmail.com] > *Sent:* Wednesday, October 16, 2019 12:22 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Constant blocking read repair for such a tiny table > > > > haven't really figured this out yet. it's not a big problem but it is > annoying for sure! the cluster was upgraded from 2.1.16 to 3.11.4. now m= y > only thing is i'm not sure if had this type of behavior before the > upgrade. i'm leaning toward a no based on my data but i'm just not 100% > sure. > > > > just 1 table, out of all the ones on the cluster has this behavior. repai= r > has been run few times via reaper. even did a nodetool compact on the > nodes (since this table is like 1GB per node..) . just don't see why ther= e > would be any inconsistency that would trigger read repair. > > > > any insight you may have would be appreciated! the real thing that > started this digging into the cluster was during some stress test > application team complained about high latency (30ms at p98). this clust= er > is oversized already for this use case with only 14GB of data per node, > there is more than enough ram so all the data is basically cached in ram. > the only thing that stands out is this crazy read repair. so this read > repair may not be my root issue but definitely shouldn't be happening lik= e > this. > > > > the vm's.. > > 12 cores > > 82GB ram > > 1.2TB local ephemeral ssd's > > > > attached the info from 1 of the nodes. > > > > On Tue, Oct 15, 2019 at 2:36 PM Alain RODRIGUEZ > wrote: > > Hello Patrick, > > > > Still in trouble with this? I must admit I'm really puzzled by your issue= . > I have no real idea of what's going on. Would you share with us the outpu= t > of: > > > > - nodetool status > > - nodetool describecluster > > - nodetool gossipinfo > > - nodetool tpstats > > > > Also you said the app is running for a long time, with no changes. What > about Cassandra? Any recent operations? > > > > I hope that with this information we might be able to understand better > and finally be able to help. > > > > ----------------------- > > Alain Rodriguez - alain@thelastpickle.com > > France / Spain > > > > The Last Pickle - Apache Cassandra Consulting > > http://www.thelastpickle.com > > > > > Le ven. 4 oct. 2019 =C3=A0 00:25, Patrick Lee = a > =C3=A9crit : > > this table was actually leveled compaction before, just changed it to siz= e > tiered yesterday while researching this. > > > > On Thu, Oct 3, 2019 at 4:31 PM Patrick Lee > wrote: > > its not really time series data. and it's not updated very often, it > would have some updates but pretty infrequent. this thing should be super > fast, on avg it's like 1 to 2ms p99 currently but if they double - triple > the traffic on that table latencies go upward to 20ms to 50ms.. the only > odd thing i see is just that there are constant read repairs that follow > the same traffic pattern on the reads, which shows constant writes on the > table (from the read repairs), which after read repair or just normal ful= l > repairs (all full through reaper, never ran any incremental repair) i wou= ld > expect it to not have any mismatches. the other 5 tables they use on the > cluster can have the same level traffic all very simple select from table > by partition key which returns a single record > > > > On Thu, Oct 3, 2019 at 4:21 PM John Belliveau > wrote: > > Hi Patrick, > > > > Is this time series data? If so, I have run into issues with repair on > time series data using the SizeTieredCompactionStrategy. I have had > better luck using the TimeWindowCompactionStrategy. > > > > John > > > > Sent from Mail > > for Windows 10 > > > > *From: *Patrick Lee > *Sent: *Thursday, October 3, 2019 5:14 PM > *To: *user@cassandra.apache.org > *Subject: *Constant blocking read repair for such a tiny table > > > > I have a cluster that is running 3.11.4 ( was upgraded a while back from > 2.1.16 ). what I see is a steady rate of read repair which is about 10% > constantly on only this 1 table. Repairs have been run (actually several > times). The table does not have a lot of writes to it so after repair, o= r > even after a read repair I would expect it to be fine. the reason i'm > having to dig into this so much is for the fact that under a much large > traffic load than their normal traffic, latencies are higher than the app > team wants > > > > I mean this thing is tiny, it's a 12x12 cluster but this 1 table is like > 1GB per node on disk. > > > > the application team is doing reads at LOCAL_QUORUM and I can simulate > this on that cluster by running a query using quorum and/or local_quorum > and in the trace can see every time running the query it comes back with = a > DigestMismatchException no matter how many times I run it. that record > hasn't been updated by the application for several months. > > > > repairs are scheduled and run every 7 days via reaper, recently in the > past week this table has been repaired at least 3 times. every time ther= e > are mismatches and data streams back and forth but yet still a constant > rate of read repairs. > > > > curious if anyone has any recommendations to look info further or have > experienced anything like this? > > > > this node has been up for 24 hours.. this is the netstats for read repair= s > > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 7481 > Mismatch (Blocking): 11425375 > Mismatch (Background): 17 > Pool Name Active Pending Completed Dropped > Large messages n/a 0 1232 0 > Small messages n/a 0 395903678 0 > Gossip messages n/a 0 603746 0 > > > > example of the schema... some modifications have been made to reduce > read_reapair and speculative_retry while troubleshooting.. > > CREATE TABLE keyspace.table1 ( > > item bigint, > > price int, > > start_date timestamp, > > end_date timestamp, > > created_date timestamp, > > cost decimal, > > list decimal, > > item_id int, > > modified_date timestamp, > > status int, > > PRIMARY KEY ((item, price), start_date, end_date) > > ) WITH CLUSTERING ORDER BY (start_date ASC, end_date ASC) > > AND read_repair_chance =3D 0.0 > > AND dclocal_read_repair_chance =3D 0.0 > > AND gc_grace_seconds =3D 864000 > > AND bloom_filter_fp_chance =3D 0.01 > > AND caching =3D { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' } > > AND comment =3D '' > > AND compaction =3D { 'class' : > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold' : 32, 'min_threshold' : 4 } > > AND compression =3D { 'chunk_length_in_kb' : 4, 'class' : > 'org.apache.cassandra.io.compress.LZ4Compressor' } > > AND default_time_to_live =3D 0 > > AND speculative_retry =3D 'NONE' > > AND min_index_interval =3D 128 > > AND max_index_interval =3D 2048 > > AND crc_check_chance =3D 1.0 > > AND cdc =3D false > > AND memtable_flush_period_in_ms =3D 0; > > > > --00000000000064d9a005950bf099 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
we do have=C2=A0otc_coalescing_strategy, we did run into t= hat long while back were we see better performance with this off.
and m= ost recently,=C2=A0disk_access_mode=C2=A0to=C2=A0mmap_index_only
a= s we have a few clusters where we would experience=C2=A0a lot more disk IO = causing high load, high cpu and so latencies were crazy high.=C2=A0 setting= this to mmap_index_only we've seen a lot better overall performance.= =C2=A0

just haven't seen this constant rate of= read repairs.



On Wed, Oct 16, 2019 at 1= 2:57 PM ZAIDI, ASAD <az192g@att.com> wrote:

Wondering if you= =E2=80=99ve =C2=A0disabled =C2=A0otc_coalescing_strategy =C2=A0CAS= SANDRA-12676 since you=E2=80=99ve upgraded from 2.x?=C2=A0 also if you found luck by = =C2=A0increasing native_transport_max_threads =C2=A0to address blocked NTRs= (CASSANDRA-11363)?

~Asad

=C2=A0<= /u>

=C2=A0

=C2=A0

From: Patrick Lee [mailto:patrickclee0207@gmail.com]
Sent: Wednesday, October 16, 2019 12:22 PM
To: u= ser@cassandra.apache.org
Subject: Re: Constant blocking read repair for such a tiny table<= /u>

=C2=A0

haven't really figured this out yet.=C2=A0 it= 9;s not a big problem but it is annoying for sure! the cluster was upgraded= from 2.1.16 to 3.11.4.=C2=A0 now my only thing is i'm not sure if had = this type of behavior before the upgrade.=C2=A0 i'm leaning toward a no based on my data but i'm just not 100% sure.=C2=A0=C2=A0

=C2=A0

just 1 table, out of all the ones on the cluster has= this behavior. repair has been run few times via reaper.=C2=A0 even did a = nodetool compact on the nodes (since this table is like 1GB per node..) . j= ust don't see why there would be any inconsistency that would trigger read repair.=C2=A0

=C2=A0

any insight you may have would be appreciated!=C2=A0= the real thing that started this digging into the cluster was during some = stress test application team complained about high latency (30ms at p98).= =C2=A0 this cluster is oversized already for this use case with only 14GB of data per node, there is more than enough ram so= all the data is basically cached in ram.=C2=A0 the only thing that stands = out is this crazy read repair.=C2=A0 so this read repair may not be my root= issue but definitely shouldn't be happening like this.

=C2=A0

the vm's..

12 cores

82GB ram

1.2TB local ephemeral ssd's

=C2=A0

attached the info from 1 of the nodes.=

=C2=A0

On Tue, Oct 15, 2019 at 2:36 PM Alain RODRIGUEZ <= arodrime@gmail.com<= /a>> wrote:

=C2=A0

Le=C2=A0ven. 4 oct. 2019 =C3=A0=C2=A000:25, Patrick = Lee <patr= ickclee0207@gmail.com> a =C3=A9crit=C2=A0:

this table was actually=C2=A0leveled compaction befo= re, just changed it to size tiered yesterday while researching this.=

=C2=A0

On Thu, Oct 3, 2019 at 4:31 PM Patrick Lee <patrickclee0207@g= mail.com> wrote:

its not really time series=C2=A0data.=C2=A0 =C2=A0an= d it's not updated very often, it would have some updates but pretty in= frequent. this thing should be super fast, on avg it's like 1 to 2ms p9= 9 currently but if they double - triple the traffic on that table latencies go upward to 20ms to 50ms.. the only odd thing i see is ju= st that there are constant read repairs that follow the same traffic patter= n on the reads, which shows constant writes on the table (from the read rep= airs), which after read repair or just normal full repairs (all full through reaper, never ran any increment= al repair) i would expect it to not have any mismatches.=C2=A0 the other 5 = tables they use on the cluster can have the same level traffic all very sim= ple select from table by partition key which returns a single record

=C2=A0

On Thu, Oct 3, 2019 at 4:21 PM John Belliveau <belliveau.john@= gmail.com> wrote:

Hi Patrick,

=C2=A0

Is this time series data? If so, I have run into iss= ues with repair on time series data using the SizeTieredCom= pactionStrategy. I have had better luck using the TimeWindowCompactionStrat= egy.

=C2=A0

John

=C2=A0

Sent from Mail for Windows 10

=C2=A0

From: Patrick = Lee
Sent: Thursday, October 3, 2019 5:14 PM
To: u= ser@cassandra.apache.org
Subject: Constant blocking read repair for such a tiny table<= u>

=C2=A0

I have a cluster that is running 3.11.4 ( was upgrad= ed a while back from 2.1.16 ).=C2=A0 what I see is a steady rate of read re= pair which is about 10% constantly on only this 1 table.=C2=A0 Repairs have been run (actually several times).=C2=A0 The table does not h= ave a lot of writes to it so after repair, or even after a read repair I wo= uld expect it to be fine.=C2=A0 the reason i'm having to dig into this = so much is for the fact that under a much large traffic load than their normal traffic, latencies are higher than the app = team wants

=C2=A0

I mean this thing is tiny, it's a 12x12 cluster = but this 1 table is like 1GB per node on disk.

=C2=A0

the application team is doing reads at LOCAL_QUORUM = and I can simulate this on that cluster by running a query using quorum and= /or local_quorum and in the trace can see every time running the query it comes back with a DigestMismatchException no matter h= ow many times I run it. that record hasn't been updated by the applicat= ion for several months.

=C2=A0

repairs are scheduled and run every 7 days via reape= r, recently in the past week this table has been repaired at least 3 times.= =C2=A0 every time there are mismatches and data streams back and forth but yet still a constant rate of read repairs.=C2=A0=

=C2=A0

curious if anyone has any recommendations to look in= fo further or have experienced anything like this?

=C2=A0

this node has been up for 24 hours.. this is the net= stats for read repairs

Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 7481
Mismatch (Blocking): 11425375
Mismatch (Background): 17
Pool Name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0Active =C2=A0 Pending =C2=A0 =C2=A0 =C2=A0Completed =C2=A0 Dropped Large messages =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0n/a =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 123= 2 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0
Small messages =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0n/a =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0395903678 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 0
Gossip messages =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 n/a= =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 603746 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 0

=C2=A0

example of the schema... some modifications have bee= n made to reduce read_reapair and speculative_retry while troubleshooting..= =C2=A0

CREATE TABLE keyspace.table1 (

=C2=A0=C2=A0=C2=A0 item bigint,

=C2=A0=C2=A0=C2=A0 price int,

=C2=A0=C2=A0=C2=A0 start_date timestamp,

=C2=A0=C2=A0=C2=A0 end_date timestamp,

=C2=A0=C2=A0=C2=A0 created_date timestamp,

=C2=A0=C2=A0=C2=A0 cost decimal,

=C2=A0=C2=A0=C2=A0 list decimal,

=C2=A0=C2=A0=C2=A0 item_id int,

=C2=A0=C2=A0=C2=A0 modified_date timestamp,<= /p>

=C2=A0=C2=A0=C2=A0 status int,

=C2=A0=C2=A0=C2=A0 PRIMARY KEY ((item, price), start_date, end_date)

) WITH CLUSTERING ORDER BY (start_date ASC, end_date ASC)

=C2=A0=C2=A0=C2=A0 AND read_repair_chance =3D 0.0

=C2=A0=C2=A0=C2=A0 AND dclocal_read_repair_chance =3D 0.0<= u>

=C2=A0=C2=A0=C2=A0 AND gc_grace_seconds =3D 864000<= u>

=C2=A0=C2=A0=C2=A0 AND bloom_filter_fp_chance =3D 0.01<= /u>

=C2=A0=C2=A0=C2=A0 AND caching =3D { 'keys' : 'ALL= 9;, 'rows_per_partition' : 'NONE' }

=C2=A0=C2=A0=C2=A0 AND comment =3D ''

=C2=A0=C2=A0=C2=A0 AND compaction =3D { 'class' : 'or= g.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'ma= x_threshold' : 32, 'min_threshold' : 4 }

=C2=A0=C2=A0=C2=A0 AND compression =3D { 'chunk_length_in_kb&= #39; : 4, 'class' : 'org.apache.cassandra.io.compress.LZ4Compre= ssor' }

=C2=A0=C2=A0=C2=A0 AND default_time_to_live =3D 0

=C2=A0=C2=A0=C2=A0 AND speculative_retry =3D 'NONE'

=C2=A0=C2=A0=C2=A0 AND min_index_interval =3D 128

=C2=A0=C2=A0=C2=A0 AND max_index_interval =3D 2048<= u>

=C2=A0=C2=A0=C2=A0 AND crc_check_chance =3D 1.0<= /u>

=C2=A0=C2=A0=C2=A0 AND cdc =3D false

=C2=A0=C2=A0=C2=A0 AND memtable_flush_period_in_ms =3D 0;<= u>

=C2=A0

--00000000000064d9a005950bf099--