From user-return-64660-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Sat Oct 26 17:02:47 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 721A7180651 for ; Sat, 26 Oct 2019 19:02:47 +0200 (CEST) Received: (qmail 53027 invoked by uid 500); 26 Oct 2019 17:02:42 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53017 invoked by uid 99); 26 Oct 2019 17:02:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Oct 2019 17:02:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 23B971A2992 for ; Sat, 26 Oct 2019 17:02:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.202 X-Spam-Level: X-Spam-Status: No, score=0.202 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=bitbrew-com.20150623.gappssmtp.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id lhwHofoftvsb for ; Sat, 26 Oct 2019 17:02:38 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.167.180; helo=mail-oi1-f180.google.com; envelope-from=ben@bitbrew.com; receiver= Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 64522BC5B7 for ; Sat, 26 Oct 2019 17:02:38 +0000 (UTC) Received: by mail-oi1-f180.google.com with SMTP id s5so3571552oie.10 for ; Sat, 26 Oct 2019 10:02:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bitbrew-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=9u9WSSlusOzEzzFBzmpXzaFj2b4XqWRo4+O85K4ugu0=; b=tE3mWS+/wmePs+aGX1TjogzjQzAevO/uMhFWR7P1QejsDBycLAZ7qNii55HT+Tfbl1 pyuwxeumdjojZ4Nvx8U8wr89n7P4Ek80frmtJk3qkMuSCIMSyWA4EUT5fxQ6YWm/fVz7 rYxiMQPapI2vlMKnNc/ugZnZFwnvN/Hsq0BqSZrsSOriHCX8l7aywCn7uZNRNI3nRvQW UwJScOFk2WnMmE43UBeDeMhubEQHjq4NsJp0aHg7d8Q9XPULiyW2vwshrMCWyEVDLRjw Lr8rF14zdxqd9xIdhisOU6QGZzO83D2UhZ3iORulcGllVeAQ6gyFSGAvspy1xOEVGZhG tNiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=9u9WSSlusOzEzzFBzmpXzaFj2b4XqWRo4+O85K4ugu0=; b=WXrBVPJk/qLKrBX/3Lnvp2q9lhTN73CeWLnwmJhi1b5ABjn0Sy86vPdSrMD4704Qjb z6B9lmT+1jynhfccqF1qdlejV1QmSSgKxZdfyM+deMuocRUfVB3hsyMxjTjOu9n2+Fak t0Im9oFgDLw5l9XhyBvaPFMWVfhA7M2oBiO1VjsUFjU6ihyoWTBhJ7hQj+6pXslLL/Ne YpPOHaiGI01phsC9AqsCAG2pmsvYqz+v1skEysJTNz8R2Qn/5HaNoQVx1Om+rftbkiVh ihf4GMXM5HjKgzhzt6VVgqCUKS9iJdzVdau37LHZODaM8tqqE96kDQ5DYCXDyxOX/E0h 8/WA== X-Gm-Message-State: APjAAAXdbATFw3jgdfRM7V9dcqRjZfKRz1adnzZgFvz4hLZ0O1E9mqrb 726Gbhm9GLgOuIhL2OTC9oX3cATHt5Wj6JEWFtfO1LXZ X-Google-Smtp-Source: APXvYqxiou1KpTcgJg1zg8Pxvzn4cwMqlxLm/UuUmmP04Ho1htENFtMRFZmSPqVDN70DtnW3E8JtrGBjsSay7D2nK2g= X-Received: by 2002:aca:4a0e:: with SMTP id x14mr2876245oia.123.1572109352260; Sat, 26 Oct 2019 10:02:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Ben Mills Date: Sat, 26 Oct 2019 13:02:21 -0400 Message-ID: Subject: Re: Repair Issues To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="0000000000007d70f10595d33b43" --0000000000007d70f10595d33b43 Content-Type: text/plain; charset="UTF-8" Thanks Ghiyasi. On Sat, Oct 26, 2019 at 9:17 AM Hossein Ghiyasi Mehr wrote: > If the problem exist still, and all nodes are up, reboot them one by one. > Then try to repair one node. After that repair other nodes one by one. > > On Fri, Oct 25, 2019 at 12:56 AM Ben Mills wrote: > >> >> Thanks Jon! >> >> This is very helpful - allow me to follow-up and ask a question. >> >> (1) Yes, incremental repairs will never be used (unless it becomes viable >> in Cassandra 4.x someday). >> (2) I hear you on the JVM - will look into that. >> (3) Been looking at Cassandra version 3.11.x though was unaware that 3.7 >> is considered non-viable for production use. >> >> For (4) - Question/Request: >> >> Note that with: >> >> -XX:MaxRAMFraction=2 >> >> the actual amount of memory allocated for heap space is effectively 2Gi >> (i.e. half of the 4Gi allocated on the machine type). We can definitely >> increase memory (for heap and nonheap), though can you expand a bit on your >> heap comment to help my understanding (as this is such a small cluster with >> such a small amount of data at rest)? >> >> Thanks again. >> >> On Thu, Oct 24, 2019 at 5:11 PM Jon Haddad wrote: >> >>> There's some major warning signs for me with your environment. 4GB heap >>> is too low, and Cassandra 3.7 isn't something I would put into production. >>> >>> Your surface area for problems is massive right now. Things I'd do: >>> >>> 1. Never use incremental repair. Seems like you've already stopped >>> doing them, but it's worth mentioning. >>> 2. Upgrade to the latest JVM, that version's way out of date. >>> 3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now). >>> 4. Increase memory to 8GB minimum, preferably 12. >>> >>> I usually don't like making a bunch of changes without knowing the root >>> cause of a problem, but in your case there's so many potential problems I >>> don't think it's worth digging into, especially since the problem might be >>> one of the 500 or so bugs that were fixed since this release. >>> >>> Once you've done those things it'll be easier to narrow down the problem. >>> >>> Jon >>> >>> >>> On Thu, Oct 24, 2019 at 4:59 PM Ben Mills wrote: >>> >>>> Hi Sergio, >>>> >>>> No, not at this time. >>>> >>>> It was in use with this cluster previously, and while there were no >>>> reaper-specific issues, it was removed to help simplify investigation of >>>> the underlying repair issues I've described. >>>> >>>> Thanks. >>>> >>>> On Thu, Oct 24, 2019 at 4:21 PM Sergio >>>> wrote: >>>> >>>>> Are you using Cassandra reaper? >>>>> >>>>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills wrote: >>>>> >>>>>> Greetings, >>>>>> >>>>>> Inherited a small Cassandra cluster with some repair issues and need >>>>>> some advice on recommended next steps. Apologies in advance for a long >>>>>> email. >>>>>> >>>>>> Issue: >>>>>> >>>>>> Intermittent repair failures on two non-system keyspaces. >>>>>> >>>>>> - platform_users >>>>>> - platform_management >>>>>> >>>>>> Repair Type: >>>>>> >>>>>> Full, parallel repairs are run on each of the three nodes every five >>>>>> days. >>>>>> >>>>>> Repair command output for a typical failure: >>>>>> >>>>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing >>>>>> keyspace platform_users with repair options (parallelism: parallel, primary >>>>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [], >>>>>> dataCenters: [], hosts: [], # of ranges: 12) >>>>>> [2019-10-18 00:22:09,242] Repair session >>>>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range >>>>>> [(-1890954128429545684,2847510199483651721], >>>>>> (8249813014782655320,-8746483007209345011], >>>>>> (4299912178579297893,6811748355903297393], >>>>>> (-8746483007209345011,-8628999431140554276], >>>>>> (-5865769407232506956,-4746990901966533744], >>>>>> (-4470950459111056725,-1890954128429545684], >>>>>> (4001531392883953257,4299912178579297893], >>>>>> (6811748355903297393,6878104809564599690], >>>>>> (6878104809564599690,8249813014782655320], >>>>>> (-4746990901966533744,-4470950459111056725], >>>>>> (-8628999431140554276,-5865769407232506956], >>>>>> (2847510199483651721,4001531392883953257]] failed with error [repair >>>>>> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2, >>>>>> [(-1890954128429545684,2847510199483651721], >>>>>> (8249813014782655320,-8746483007209345011], >>>>>> (4299912178579297893,6811748355903297393], >>>>>> (-8746483007209345011,-8628999431140554276], >>>>>> (-5865769407232506956,-4746990901966533744], >>>>>> (-4470950459111056725,-1890954128429545684], >>>>>> (4001531392883953257,4299912178579297893], >>>>>> (6811748355903297393,6878104809564599690], >>>>>> (6878104809564599690,8249813014782655320], >>>>>> (-4746990901966533744,-4470950459111056725], >>>>>> (-8628999431140554276,-5865769407232506956], >>>>>> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x >>>>>> (progress: 26%) >>>>>> [2019-10-18 00:22:09,246] Some repair failed >>>>>> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds >>>>>> >>>>>> Additional Notes: >>>>>> >>>>>> Repairs encounter above failures more often than not. Sometimes on >>>>>> one node only, though occasionally on two. Sometimes just one of the two >>>>>> keyspaces, sometimes both. Apparently the previous repair schedule >>>>>> for this cluster included incremental repairs (script alternated between >>>>>> incremental and full repairs). After reading this TLP article: >>>>>> >>>>>> >>>>>> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html >>>>>> >>>>>> the repair script was replaced with cassandra-reaper (v1.4.0), which >>>>>> was run with its default configs. Reaper was fine but only obscured the >>>>>> ongoing issues (it did not resolve them) and complicated the debugging >>>>>> process and so was then removed. The current repair schedule is as >>>>>> described above under Repair Type. >>>>>> >>>>>> Attempts at Resolution: >>>>>> >>>>>> (1) nodetool scrub was attempted on the offending keyspaces/tables to >>>>>> no effect. >>>>>> >>>>>> (2) sstablescrub has not been attempted due to the current design of >>>>>> the Docker image that runs Cassandra in each Kubernetes pod - i.e. there is >>>>>> no way to stop the server to run this utility without killing the only pid >>>>>> running in the container. >>>>>> >>>>>> Related Error: >>>>>> >>>>>> Not sure if this is related, though sometimes, when either: >>>>>> >>>>>> (a) Running nodetool snapshot, or >>>>>> (b) Rolling a pod that runs a Cassandra node, which calls nodetool >>>>>> drain prior shutdown, >>>>>> >>>>>> the following error is thrown: >>>>>> >>>>>> -- StackTrace -- >>>>>> java.lang.RuntimeException: Last written key >>>>>> DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda, >>>>>> 10df3ba16eb24c8ebdddc0c7af586bda) >= current key >>>>>> DecoratedKey(00000000-0000-0000-0000-000000000000, >>>>>> 17343121887f480c9ba87c0e32206b74) writing into >>>>>> /cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db >>>>>> at >>>>>> org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114) >>>>>> at >>>>>> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153) >>>>>> at >>>>>> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48) >>>>>> at >>>>>> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441) >>>>>> at >>>>>> org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:477) >>>>>> at >>>>>> org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:363) >>>>>> at >>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>> >>>>>> Here are some details on the environment and configs in the event >>>>>> that something is relevant. >>>>>> >>>>>> Environment: Kubernetes >>>>>> Environment Config: Stateful set of 3 replicas >>>>>> Storage: Persistent Volumes >>>>>> Storage Class: SSD >>>>>> Node OS: Container-Optimized OS >>>>>> Container OS: Ubuntu 16.04.3 LTS >>>>>> >>>>>> Version: Cassandra 3.7 >>>>>> Data Centers: 1 >>>>>> Racks: 3 (one per zone) >>>>>> Nodes: 3 >>>>>> Tokens: 4 >>>>>> Replication Factor: 3 >>>>>> Replication Strategy: NetworkTopologyStrategy (all keyspaces) >>>>>> Compaction Strategy: STCS (all tables) >>>>>> Read/Write Requirements: Blend of both >>>>>> Data Load: <1GB per node >>>>>> gc_grace_seconds: default (10 days - all tables) >>>>>> >>>>>> Memory: 4Gi per node >>>>>> CPU: 3.5 per node (3500m) >>>>>> >>>>>> Java Version: 1.8.0_144 >>>>>> >>>>>> Heap Settings: >>>>>> >>>>>> -XX:+UnlockExperimentalVMOptions >>>>>> -XX:+UseCGroupMemoryLimitForHeap >>>>>> -XX:MaxRAMFraction=2 >>>>>> >>>>>> GC Settings: (CMS) >>>>>> >>>>>> -XX:+UseParNewGC >>>>>> -XX:+UseConcMarkSweepGC >>>>>> -XX:+CMSParallelRemarkEnabled >>>>>> -XX:SurvivorRatio=8 >>>>>> -XX:MaxTenuringThreshold=1 >>>>>> -XX:CMSInitiatingOccupancyFraction=75 >>>>>> -XX:+UseCMSInitiatingOccupancyOnly >>>>>> -XX:CMSWaitDuration=30000 >>>>>> -XX:+CMSParallelInitialMarkEnabled >>>>>> -XX:+CMSEdenChunksRecordAlways >>>>>> >>>>>> Any ideas are much appreciated. >>>>>> >>>>> -- Ben Mills DevOps Engineer --0000000000007d70f10595d33b43 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks Ghiyasi.=C2=A0

On Sat, Oct 26, 2019= at 9:17 AM Hossein Ghiyasi Mehr <ghiyasimehr@gmail.com> wrote:
If the problem exist still, and all nodes are up= , reboot them one by one. Then try to repair one node. After=C2=A0that repa= ir other nodes one by one.

On Fri, Oct 25, 2019 at 12:56 AM Ben Mills = <ben@bitbrew.com> wrote:

Thanks Jon!

This is very helpful - allow me to follow-up and ask a question.

(1) Yes, incremental repairs will never= be used (unless it becomes viable in Cassandra 4.x someday).
(2) I hear you on the JVM - will look into that.
(3) B= een looking at Cassandra version 3.11.x though was unaware that 3.7 is cons= idered non-viable for production use.=C2=A0

=
For (4) - Question/Request:

Note that with:

-XX:MaxRAMFracti= on=3D2

the actual amount of memo= ry allocated for heap space is effectively 2Gi (i.e. half of the 4Gi alloca= ted on the machine type). We can definitely increase memory (for heap and n= onheap), though can you expand a bit on your heap comment to help my unders= tanding (as this is such a small cluster with such a small amount of data a= t rest)?

Thanks again.
There's some major= warning signs for me with your environment.=C2=A0 4GB heap is too low, and= Cassandra 3.7 isn't something I would put into production.

Your surface area for problems is massive right now.=C2=A0 Things I= 'd do:

1. Never use incremental repair.=C2=A0 = Seems like you've already stopped doing them, but it's worth mentio= ning.
2. Upgrade to the latest JVM, that version's=C2=A0way o= ut of date.
3. Upgrade to Cassandra 3.11.latest (we're voting= on 3.11.5 right now).=C2=A0
4. Increase memory to 8GB minimum, p= referably 12.

I usually don't like making= a bunch of changes without knowing the root cause of a problem, but in you= r case there's so many potential problems I don't think it's wo= rth digging=C2=A0into, especially since the problem might be one of the 500= or so bugs that were fixed since this release.

On= ce you've done those things it'll be easier to narrow down the prob= lem.

Jon


On Thu, Oct 24,= 2019 at 4:59 PM Ben Mills <ben@bitbrew.com> wrote:
Hi Sergio,

No, not at this time.
=

It was in use = with this cluster previously, and while there were no reaper-specific issue= s, it was removed to help simplify investigation of the underlying repair i= ssues I've described.

Thanks.

On Thu, Oct 24, 2019 at 4:21 PM Sergio <lapostadisergi= o@gmail.com> wrote:
Are you using Cassandra reaper?

On Thu, Oct 24,= 2019, 12:31 PM Ben Mills <ben@bitbrew.com> wrote:
Greetings,

Inherited a small Cassandra cluster with some repair= issues and need some advice on recommended next steps. Apologies in advanc= e for a long email.

Issue:

Intermittent repair failures on tw= o non-system keyspaces.

- platform_users
- platform_management
Repair Type:

Full, parallel repairs are run on each of the thre= e nodes every five days.

Repair command output for a typical failure= :

[2019-10-18 00:22:09,109] Starting repair command #46, repairing k= eyspace platform_users with repair options (parallelism: parallel, primary = range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataC= enters: [], hosts: [], # of ranges: 12)
[2019-10-18 00:22:09,242] Repair= session 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range [(-1890954128429545= 684,2847510199483651721], (8249813014782655320,-8746483007209345011], (4299= 912178579297893,6811748355903297393], (-8746483007209345011,-86289994311405= 54276], (-5865769407232506956,-4746990901966533744], (-4470950459111056725,= -1890954128429545684], (4001531392883953257,4299912178579297893], (68117483= 55903297393,6878104809564599690], (6878104809564599690,8249813014782655320]= , (-4746990901966533744,-4470950459111056725], (-8628999431140554276,-58657= 69407232506956], (2847510199483651721,4001531392883953257]] failed with err= or [repair #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_t= okens_v2, [(-1890954128429545684,2847510199483651721], (8249813014782655320= ,-8746483007209345011], (4299912178579297893,6811748355903297393], (-874648= 3007209345011,-8628999431140554276], (-5865769407232506956,-474699090196653= 3744], (-4470950459111056725,-1890954128429545684], (4001531392883953257,42= 99912178579297893], (6811748355903297393,6878104809564599690], (68781048095= 64599690,8249813014782655320], (-4746990901966533744,-4470950459111056725],= (-8628999431140554276,-5865769407232506956], (2847510199483651721,40015313= 92883953257]]] Validation failed in /10.x.x.x (progress: 26%)
[2019-10-1= 8 00:22:09,246] Some repair failed
[2019-10-18 00:22:09,248] Repair comm= and #46 finished in 0 seconds

Additional Notes:

Repairs encou= nter above failures more often than not. Sometimes on one node only, though= occasionally on two. Sometimes just one of the two keyspaces, sometimes bo= th. Apparently the previo= us repair schedule for this cluster included incremental repairs (sc= ript alternated between incremental and full repairs). After reading this T= LP article:

h= ttps://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.= html

the repair script was replaced with cassandra-reaper (v1.4.= 0), which was run with its default configs. Reaper was fine but only obscur= ed the ongoing issues (it did not resolve them) and complicated the debuggi= ng process and so was then removed. The current repair schedule is as descr= ibed above under Repair Type.

Attempts at Resolution:

(1) nod= etool scrub was attempted on the offending keyspaces/tables to no effect.
(2) sstablescrub has not been attempted due to the current design of = the Docker image that runs Cassandra in each Kubernetes pod - i.e. there is= no way to stop the server to run this utility without killing the only pid= running in the container.

Related Error:

Not sure if this is= related, though sometimes, when either:

(a) Running nodetool snapshot, or
(b) Rolling=C2=A0a pod that runs a Cassandra node, which calls nod= etool drain prior shutdown,

the= following error is thrown:

-- StackTrace --
java.lang.RuntimeExc= eption: Last written key DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda,= 10df3ba16eb24c8ebdddc0c7af586bda) >=3D current key DecoratedKey(0000000= 0-0000-0000-0000-000000000000, 17343121887f480c9ba87c0e32206b74) writing in= to /cassandra_data/data/platform_management/device_by_tenant_v2-e91529202cc= f11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.cassandra.io.sstable.f= ormat.big.BigTableWriter.beforeAppend(BigTableWriter.java:114)
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.cassandra.io.sstable.forma= t.big.BigTableWriter.append(BigTableWriter.java:153)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.cassandra.io.sstable.SimpleSSTableMu= ltiWriter.append(SimpleSSTableMultiWriter.java:48)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 at org.apache.cassandra.db.Memtable$FlushRunnable.writ= eSortedContents(Memtable.java:441)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 at org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java= :477)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.cassandra.= db.Memtable$FlushRunnable.call(Memtable.java:363)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.FutureTask.run(FutureTask.java= :266)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.= ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
=C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.ThreadPoolExecutor$Wor= ker.run(ThreadPoolExecutor.java:624)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 at java.lang.Thread.run(Thread.java:748)

Here are some detail= s on the environment and configs in the event that something is relevant.
Environment: Kubernetes
Environment Config: Stateful set of 3 repl= icas
Storage: Persistent Volumes
Storage Class: SSD
Node OS: Conta= iner-Optimized OS
Container OS: Ubuntu 16.04.3 LTS

Version: Cassa= ndra 3.7
Data Centers: 1
Racks: 3 (one per zone)
Nodes: 3
Token= s: 4
Replication Factor: 3
Replication Strategy: NetworkTopologyStrat= egy (all keyspaces)
Compaction Strategy: STCS (all tables)
Read/Write= Requirements: Blend of both
Data Load: <1GB per node
gc_grace_sec= onds: default (10 days - all tables)

Memory: 4Gi per node
CPU: 3.= 5 per node (3500m)

Java Version: 1.8.0_144

Heap Settings:
=
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap-XX:MaxRAMFraction=3D2

GC Settings: (CMS)

-XX:+UseParNewGC-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorR= atio=3D8
-XX:MaxTenuringThreshold=3D1
-XX:CMSInitiatingOccupancyFract= ion=3D75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=3D300= 00
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways<= /font>

Any ideas are much appreciated.=C2=A0=
--
<= div>
Ben Mills
DevOps Engineer

--0000000000007d70f10595d33b43--