Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Subject: Re: Uncaught exception on thread CounterMutationStage
To: user@cassandra.apache.org
References: <8997021b-1035-4a48-832d-a54d01cc674f@sandbox-interactive.com>
 <pony-cb367aa5aed1991b197e48bcdd89864c2a894a70-ab781a3d22d76171d84f51411e10a01008113c29@user.cassandra.apache.org>
From: David Salz <david@sandbox-interactive.com>
Message-ID: <a8a6f405-fed7-959a-b32f-d683e4c031bf@sandbox-interactive.com>
Date: Thu, 27 Jul 2017 16:41:36 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <pony-cb367aa5aed1991b197e48bcdd89864c2a894a70-ab781a3d22d76171d84f51411e10a01008113c29@user.cassandra.apache.org>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Language: de-DE
archived-at: Thu, 27 Jul 2017 14:41:48 -0000

Hi Jeff,

thanks for the pointers!

We upgraded to C* 3.11.0 now and the situation has improved a little
bit, the node does not die completely any more, but the
WriteTimeoutExceptions persists and still 'freeze' the node for a couple
of minutes.


> A single node with 20 cores and 256GB of RAM is probably not going to
> be the best choice - while it's a great machine, the default cassandra
> config really isn't tuned for that # of cores or that much RAM (it'll
> almost all be left for page cache, which is great for reads, and less
> great for write heavy workloads). What sort of heap settings are you
> using?=20

-ea
-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=3D42
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:StringTableSize=3D1000003
-XX:+AlwaysPreTouch
-XX:-UseBiasedLocking
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:+UseNUMA
-XX:+PerfDisableSharedMem
-Djava.net.preferIPv4Stack=3Dtrue
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=3D5
-XX:MaxGCPauseMillis=3D700
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=3D10
-XX:GCLogFileSize=3D10M
-Xms98304M
-Xmx98304M

GC does not seem to be the issue, seeing GC runs every 30 seconds and
they usually finish well below the 700ms limit. Will enable GC log file
though, don't have that right now.

> You're getting timeouts on a single node cluster, which usually means y=
ou're in a GC spin a thread deadlocked or a thread pool backed up or simi=
lar. Seeing 'nodetool tpstats' may be a starting point. Knowing whether t=
he node stops processing all data at this time, or just some of it, would=
 also help. You'd want to take a look for indications of a GC pause (GCIn=
spector log lines, or even better actual GC logs), and if that doesn't wo=
rk, jstack output thrown onto pastebin or gist or similar.
>
Good point. Checked tpstats and found a high number (millions) of
all-time blocked Native-Transport-Request. Googled a bit and now set

-Dcassandra.max_queued_native_transport_requests=3D4096

and

native_transport_max_threads=3D4096

Seeing no more blocked NTRs so far. Do you think this could have
contributed to the problem? The default values seemed way too small for
our load and our machine at any rate.
Again, thanks for the help so far!

David


--=20
-----------------------------------
Technical Director / Co-Founder
Sandbox Interactive GmbH
http://albiononline.com


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org