Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ADD9F17BBE for ; Mon, 1 Jun 2015 21:30:58 +0000 (UTC) Received: (qmail 48982 invoked by uid 500); 1 Jun 2015 21:30:54 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 48938 invoked by uid 500); 1 Jun 2015 21:30:54 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 48928 invoked by uid 99); 1 Jun 2015 21:30:54 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jun 2015 21:30:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 50328181A34 for ; Mon, 1 Jun 2015 21:30:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.98 X-Spam-Level: ** X-Spam-Status: No, score=2.98 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Tt7QB_binyGm for ; Mon, 1 Jun 2015 21:30:43 +0000 (UTC) Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 7874847BE6 for ; Mon, 1 Jun 2015 21:30:42 +0000 (UTC) Received: by wibut5 with SMTP id ut5so50374313wib.1 for ; Mon, 01 Jun 2015 14:29:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:message-id:references:to; bh=u+FgCpmbjOplpafvqTVDFys7rvi728LLLaZzuwU4EYw=; b=PFn9v2xdHe9xVC/B97S+w64TVT8Htp9ZFqwOLyB51wM5dh+6H7Bh3B6bs4LLvIP06+ vkrY/dfo9jmGck6xUtwcsKxyt3qkfnjtc79bdArAP6wEoVVW6Jy5fTnA4QxbzWOnaiAl v5U6WSSoH1oWYKoP0mZNfQPO1leey52rKuoRiMn1CdrI1h0VGogCqDH+IvTy6c3cZhFa oVcYs1CMiVqn4g3XdOU+1etIGeumyk4iiCnLsNpPYzSYl7/ZA+uJ+V2qTk4bGXzGk+VF 0P3k5mAEm9abLWyGSmwHwwIT2YusK8gQ39Eaq7+X9IboFaQ4P3WQPOlBPXtrRjhuonZB BHNA== X-Gm-Message-State: ALoCoQlblMZb3zcr27btbhR3Wyh1g6+BhFU5VP+5HiGrpKt5VRVnLjki0ijddisLQx4GMRwCS+9B X-Received: by 10.194.200.73 with SMTP id jq9mr20085162wjc.52.1433194151463; Mon, 01 Jun 2015 14:29:11 -0700 (PDT) Received: from [192.168.1.112] (cpe-70-113-52-246.austin.res.rr.com. [70.113.52.246]) by mx.google.com with ESMTPSA id j7sm23696498wjz.11.2015.06.01.14.29.09 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 01 Jun 2015 14:29:10 -0700 (PDT) Content-Type: multipart/signed; boundary="Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372"; protocol="application/pkcs7-signature"; micalg=sha1 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: GC pauses affecting entire cluster. From: graham sanderson In-Reply-To: Date: Mon, 1 Jun 2015 16:29:03 -0500 Cc: Anuj Wadehra Message-Id: <8AD62DCF-52C5-4762-AE35-85EA021D57AA@vast.com> References: <498013394.1663504.1433168420320.JavaMail.yahoo@mail.yahoo.com> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.2098) --Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372 Content-Type: multipart/alternative; boundary="Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF" --Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Yes native_objects is the way to go=E2=80=A6 you can tell if memtables = are you problem because you=E2=80=99ll see promotion failures of objects = sized 131074 dwords. If your h/w is fast enough make your young gen as big as possible - we = can collect 8G in sub second always, and this gives you your best chance = of transient objects (especially if you still have thrift clients) = leaking into the old gen. Moving to 2.1.x (and off heap memtables) from = 2.0.x we have reduced our old gen down from 16gig to 12gig and will keep = shrinking it, but have had no promotion failures yet, and it=E2=80=99s = been several months. Note we are running a patched 2.1.3, but 2.1.5 has the equivalent = important bugs fixed (that might have given you memory issues) > On Jun 1, 2015, at 3:00 PM, Carl Hu wrote: >=20 > Thank you for the suggestion. After analysis of your settings, the = basic hypothesis here is to promote very quickly to Old Gen because of a = rapid accumulation of heap usage due to memtables. We happen to be = running on 2.1, and I thought a more conservative approach that your = (quite aggressive gc settings) is to try the new = memtable_allocation_type with offheap_objects and see if the memtable = pressure is relieved sufficiently such that the standard gc settings can = keep up. >=20 > The experiment is in progress and I will report back with the results. >=20 > On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra > wrote: > We have write heavy workload and used to face promotion failures/long = gc pauses with Cassandra 2.0.x. I am not into code yet but I think that = memtable and compaction related objects have mid-life and write heavy = workload is not suitable for generation collection by default. So, we = tuned JVM to make sure that minimum objects are promoted to Old Gen and = achieved great success in that: > MAX_HEAP_SIZE=3D"12G" > HEAP_NEWSIZE=3D"3G" > -XX:SurvivorRatio=3D2 > -XX:MaxTenuringThreshold=3D20 > -XX:CMSInitiatingOccupancyFraction=3D70 > JVM_OPTS=3D"$JVM_OPTS -XX:ConcGCThreads=3D20" > JVM_OPTS=3D"$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > JVM_OPTS=3D"$JVM_OPTS -XX:+UseGCTaskAffinity" > JVM_OPTS=3D"$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > JVM_OPTS=3D"$JVM_OPTS -XX:ParGCCardsPerStrideChunk=3D32768" > JVM_OPTS=3D"$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > JVM_OPTS=3D"$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=3D30000" > JVM_OPTS=3D"$JVM_OPTS -XX:CMSWaitDuration=3D2000" > JVM_OPTS=3D"$JVM_OPTS -XX:+CMSEdenChunksRecordAlways" > JVM_OPTS=3D"$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled" > JVM_OPTS=3D"$JVM_OPTS -XX:-UseBiasedLocking" > We also think that default total_memtable_space_in_mb=3D1/4 heap is = too much for write heavy loads. By default, young gen is also 1/4 = heap.We reduced it to 1000mb in order to make sure that memtable related = objects dont stay in memory for too long. Combining this with = SurvivorRatio=3D2 and MaxTenuringThreshold=3D20 did the job well. GC was = very consistent. No Full GC observed. >=20 > Environment: 3 node cluster with each node having 24cores,64G RAM and = SSDs in RAID5. > We are making around 12k writes/sec in 5 cf (one with 4 sec index) and = 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with = max data of around 100mb per row >=20 > Yes. Node marking down has cascading effect. Within seconds all nodes = in our cluster are marked down.=20 >=20 > Thanks > Anuj Wadehra >=20 >=20 >=20 > On Monday, 1 June 2015 7:12 PM, Carl Hu > wrote: >=20 >=20 > We are running Cassandra version 2.1.5.469 on 15 nodes and are = experiencing a problem where the entire cluster slows down for 2.5 = minutes when one node experiences a 17 second stop-the-world gc. These = gc's happen once every 2 hours. I did find a ticket that seems related = to this: https://issues.apache.org/jira/browse/CASSANDRA-3853 = , but Jonathan = Ellis has resolved this ticket.=20 >=20 > We are running standard gc settings, but this ticket is not so much = concerned with the 17 second gc on a single node (after all, we have 14 = others), but that the cascading performance problem. >=20 > We running standard values of dynamic_snitch_badness_threshold (0.1) = and phi_convict_threshold (8). (These values are relevant for the = dynamic snitch routing requests away from the frozen node or the failure = detector marking the node as 'down'). >=20 > We use the python client in default round robin mode, so all clients = hits the coordinators at all nodes in round robin. One theory is that = since the coordinator on all nodes must hit the frozen node at some = point in the 17 seconds, each node's request queues fills up, and the = entire cluster thus freezes up. That would explain a 17 second freeze = but would not explain the 2.5 minute slowdown (10x increase in request = latency @P50). >=20 > I'd love your thoughts. I've provided the GC chart here. >=20 > Carl >=20 > >=20 >=20 >=20 --Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Yes native_objects is the way to go=E2=80=A6 you can tell if = memtables are you problem because you=E2=80=99ll see promotion failures = of objects sized 131074 dwords.

If your h/w is fast enough make your young gen as big as = possible - we can collect 8G in sub second always, and this gives you = your best chance of transient objects (especially if you still have = thrift clients) leaking into the old gen. Moving to 2.1.x (and off heap = memtables) from 2.0.x we have reduced our old gen down from 16gig to = 12gig and will keep shrinking it, but have had no promotion failures = yet, and it=E2=80=99s been several months.

Note we are running a patched 2.1.3, = but 2.1.5 has the equivalent important bugs fixed (that might have given = you memory issues)

On Jun 1, 2015, at 3:00 PM, = Carl Hu <me@carlhu.com> wrote:

Thank you for the suggestion. After analysis of your = settings, the basic hypothesis here is to promote very quickly to Old = Gen because of a rapid accumulation of heap usage due to memtables. We = happen to be running on 2.1, and I thought a more conservative approach = that your (quite aggressive gc settings) is to try the = new memtable_allocation_type with offheap_objects and see if the = memtable pressure is relieved sufficiently such that the standard gc = settings can keep up.

The experiment is in progress and I will report back with the = results.

On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra = <anujw_2003@yahoo.co.in> = wrote:
We have write = heavy workload and=20 used to face promotion failures/long gc pauses with Cassandra 2.0.x. I=20= am not into code yet but I think that memtable and compaction related=20 objects have mid-life and write heavy workload is not suitable for=20 generation collection by default. So, we tuned JVM to make sure that=20 minimum objects are promoted to Old Gen and achieved great success in=20 that:
MAX_HEAP_SIZE=3D"12G"
HEAP_NEWSIZE=3D"3G"
-XX:SurvivorRatio=3D2
-XX:MaxTenuringThreshold=3D20
-XX:CMSInitiatingOccupancyFraction=3D70
JVM_OPTS=3D"$JVM_OPTS -XX:ConcGCThreads=3D20"
JVM_OPTS=3D"$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS=3D"$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS=3D"$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS=3D"$JVM_OPTS -XX:ParGCCardsPerStrideChunk=3D32768"
= JVM_OPTS=3D"$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS=3D"$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=3D30000"
JVM_OPTS=3D"$JVM_OPTS -XX:CMSWaitDuration=3D2000"
JVM_OPTS=3D"$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
JVM_OPTS=3D"$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
JVM_OPTS=3D"$JVM_OPTS -XX:-UseBiasedLocking"
We also think that default = total_memtable_space_in_mb=3D1/4 heap is too much for write heavy loads. By default, young gen is also 1/4 heap.We=20= reduced it to 1000mb in order to make sure that memtable related objects dont stay in memory for too long. Combining this with SurvivorRatio=3D2=20= and MaxTenuringThreshold=3D20 did the job well. GC was very consistent. = No Full GC observed.

Environment: 3 node cluster with each node having 24cores,64G = RAM and SSDs in RAID5.
We are making around 12k = writes/sec in 5 cf (one with 4 sec index)=20 and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows=20= with max data of around 100mb per row

Yes. Node marking down has cascading effect. = Within seconds all nodes in our cluster are marked down.

Thanks
Anuj Wadehra



=
= On Monday, 1 June 2015 7:12 PM, Carl Hu <me@carlhu.com> wrote:
=

We are running Cassandra version = 2.1.5.469 on 15 nodes and are experiencing a problem where the entire = cluster slows down for 2.5 minutes when one node experiences a 17 second = stop-the-world gc. These gc's happen once every 2 hours. I did find a = ticket that seems related to this: https://issues.apache.org/jira/browse/CASSANDRA-3853, but = Jonathan Ellis has resolved this ticket. 

We are running standard gc settings, = but this ticket is not so much concerned with the 17 second gc on a = single node (after all, we have 14 others), but that the cascading = performance problem.

We running standard values of = dynamic_snitch_badness_threshold (0.1) and phi_convict_threshold (8). = (These values are relevant for the dynamic snitch routing requests away = from the frozen node or the failure detector marking the node as = 'down').

We = use the python client in default round robin mode, so all clients hits = the coordinators at all nodes in round robin. One theory is that since = the coordinator on all nodes must hit the frozen node at some point in = the 17 seconds, each node's request queues fills up, and the entire = cluster thus freezes up. That would explain a 17 second freeze but would = not explain the 2.5 minute slowdown (10x increase in request latency = @P50).

I'd = love your thoughts. I've provided the GC chart here.

Carl

<d2c95dce-0848-11e5-91f7-6b223349fc14.pn= g>


=


= --Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF-- --Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372 Content-Disposition: attachment; filename=smime.p7s Content-Type: application/pkcs7-signature; name=smime.p7s Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICuzCCArcw ggIgAgIBTDANBgkqhkiG9w0BAQUFADCBojELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAk9SMREwDwYD VQQHEwhQb3J0bGFuZDEWMBQGA1UEChMNT21uaS1FeHBsb3JlcjEWMBQGA1UECxMNSVQgRGVwYXJ0 bWVudDEbMBkGA1UEAxMSd3d3LmNvcm5lcmNhc2UuY29tMSYwJAYJKoZIhvcNAQkBFhdibG9ja291 dEBjb3JuZXJjYXNlLmNvbTAeFw0xMTA0MDYxNjE0MzFaFw0yMTA0MDMxNjE0MzFaMIGjMQswCQYD VQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UEBxMNU2FuIEZyYW5jaXNjbzEWMBQG A1UEChMNVmFzdC5jb20gSW5jLjEUMBIGA1UECxMLRW5naW5lZXJpbmcxGTAXBgNVBAMTEEdyYWhh bSBTYW5kZXJzb24xHjAcBgkqhkiG9w0BCQEWD2dyYWhhbUB2YXN0LmNvbTCBnzANBgkqhkiG9w0B AQEFAAOBjQAwgYkCgYEAm4K/W/0VdaOiS6tC1G8tSCAw989XCsJXxVPiny/hND6T0jVv4vP0JRiO vNzH6uoINoKQfgUKa+GCqILdY7Jdx61/WKqxltFTu5D0H8sFFNIKgf9cd3yU6t2susKrxaDXRCul pmcJ3AFg4xuG3ZUZt+XTYhBebQfjwgGQh3/pkQUCAwEAATANBgkqhkiG9w0BAQUFAAOBgQCKW+hQ JqNkPRht5fl8FHku80BLAH9ezEJtZJ6EU9fcK9jNPkAJgSEgPXQ++jE+4iYI2nIb/h5RILUxd1Ht m/yZkNRUVCg0+0Qj6aMT/hfOT0kdP8/9OnbmIp2T6qvNN2rAGU58tt3cbuT2j3LMTS2VOGykK4He iNYYqr+K6sPDHTGCAy0wggMpAgEBMIGpMIGiMQswCQYDVQQGEwJVUzELMAkGA1UECBMCT1IxETAP BgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYDVQQLEw1JVCBEZXBh cnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG9w0BCQEWF2Jsb2Nr b3V0QGNvcm5lcmNhc2UuY29tAgIBTDAJBgUrDgMCGgUAoIIB2TAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA2MDEyMTI5MDRaMCMGCSqGSIb3DQEJBDEWBBR0wV7S 5aa0H2ycbmMX8g000mD2cDCBugYJKwYBBAGCNxAEMYGsMIGpMIGiMQswCQYDVQQGEwJVUzELMAkG A1UECBMCT1IxETAPBgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYD VQQLEw1JVCBEZXBhcnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG 9w0BCQEWF2Jsb2Nrb3V0QGNvcm5lcmNhc2UuY29tAgIBTDCBvAYLKoZIhvcNAQkQAgsxgayggakw gaIxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJPUjERMA8GA1UEBxMIUG9ydGxhbmQxFjAUBgNVBAoT DU9tbmktRXhwbG9yZXIxFjAUBgNVBAsTDUlUIERlcGFydG1lbnQxGzAZBgNVBAMTEnd3dy5jb3Ju ZXJjYXNlLmNvbTEmMCQGCSqGSIb3DQEJARYXYmxvY2tvdXRAY29ybmVyY2FzZS5jb20CAgFMMA0G CSqGSIb3DQEBAQUABIGAYX9c72JNWGl4D2dPQhEcWgj71+yM0WCJh6GohRlJOIt73Sk+5MgTuDVD YyJK1T3Mz4Jiw0fYnzP/WdzC0PxhtVxHWwVrnjOM/Jr+YQVHikjmK127TOnS7S1OY9J3/le6azo7 u2oXoFlEPcbSw3H74SH2M1qsrXLsx0Lksf4BIZIAAAAAAAA= --Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372--