Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B961F200CD2 for ; Thu, 27 Jul 2017 16:41:48 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B7C7E16AE1A; Thu, 27 Jul 2017 14:41:48 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 055D516AE18 for ; Thu, 27 Jul 2017 16:41:47 +0200 (CEST) Received: (qmail 52741 invoked by uid 500); 27 Jul 2017 14:41:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 52731 invoked by uid 99); 27 Jul 2017 14:41:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jul 2017 14:41:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 6D7A1C02AA for ; Thu, 27 Jul 2017 14:41:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.048 X-Spam-Level: X-Spam-Status: No, score=-0.048 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_LOW=-0.7, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id SxrvjrVDkhdg for ; Thu, 27 Jul 2017 14:41:43 +0000 (UTC) Received: from www256.your-server.de (www256.your-server.de [188.40.28.36]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 77F175F243 for ; Thu, 27 Jul 2017 14:41:43 +0000 (UTC) Received: from [62.217.54.246] (helo=[10.23.0.197]) by www256.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-SHA:256) (Exim 4.85_2) (envelope-from ) id 1dajyq-0004Gn-PR for user@cassandra.apache.org; Thu, 27 Jul 2017 16:41:36 +0200 Subject: Re: Uncaught exception on thread CounterMutationStage To: user@cassandra.apache.org References: <8997021b-1035-4a48-832d-a54d01cc674f@sandbox-interactive.com> From: David Salz Message-ID: Date: Thu, 27 Jul 2017 16:41:36 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Content-Language: de-DE X-Authenticated-Sender: david@sandbox-interactive.com X-Virus-Scanned: Clear (ClamAV 0.99.2/23601/Thu Jul 27 10:15:01 2017) archived-at: Thu, 27 Jul 2017 14:41:48 -0000 Hi Jeff, thanks for the pointers! We upgraded to C* 3.11.0 now and the situation has improved a little bit, the node does not die completely any more, but the WriteTimeoutExceptions persists and still 'freeze' the node for a couple of minutes. > A single node with 20 cores and 256GB of RAM is probably not going to > be the best choice - while it's a great machine, the default cassandra > config really isn't tuned for that # of cores or that much RAM (it'll > almost all be left for page cache, which is great for reads, and less > great for write heavy workloads). What sort of heap settings are you > using?=20 -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=3D42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=3D1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=3Dtrue -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=3D5 -XX:MaxGCPauseMillis=3D700 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=3D10 -XX:GCLogFileSize=3D10M -Xms98304M -Xmx98304M GC does not seem to be the issue, seeing GC runs every 30 seconds and they usually finish well below the 700ms limit. Will enable GC log file though, don't have that right now. > You're getting timeouts on a single node cluster, which usually means y= ou're in a GC spin a thread deadlocked or a thread pool backed up or simi= lar. Seeing 'nodetool tpstats' may be a starting point. Knowing whether t= he node stops processing all data at this time, or just some of it, would= also help. You'd want to take a look for indications of a GC pause (GCIn= spector log lines, or even better actual GC logs), and if that doesn't wo= rk, jstack output thrown onto pastebin or gist or similar. > Good point. Checked tpstats and found a high number (millions) of all-time blocked Native-Transport-Request. Googled a bit and now set -Dcassandra.max_queued_native_transport_requests=3D4096 and native_transport_max_threads=3D4096 Seeing no more blocked NTRs so far. Do you think this could have contributed to the problem? The default values seemed way too small for our load and our machine at any rate. Again, thanks for the help so far! David --=20 ----------------------------------- Technical Director / Co-Founder Sandbox Interactive GmbH http://albiononline.com --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org For additional commands, e-mail: user-help@cassandra.apache.org