Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C3581739B for ; Fri, 24 Oct 2014 19:54:15 +0000 (UTC) Received: (qmail 82206 invoked by uid 500); 24 Oct 2014 19:54:12 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 82172 invoked by uid 500); 24 Oct 2014 19:54:12 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 82162 invoked by uid 99); 24 Oct 2014 19:54:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Oct 2014 19:54:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of graham@vast.com designates 209.85.212.170 as permitted sender) Received: from [209.85.212.170] (HELO mail-wi0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Oct 2014 19:53:46 +0000 Received: by mail-wi0-f170.google.com with SMTP id n3so1998568wiv.3 for ; Fri, 24 Oct 2014 12:52:15 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=CE5goYQETR7maLDPK0Kx+jmUQyyjmPv5CvMJ0ubsDuY=; b=aveBL5zMysnO9eJKmcE103rq3FabpwR8xiQMtEvQJAMkbaz0uXfg1xOx77h7yiG29j 9p+5OnQ0nMkZfW+BVHPjthd7AVqMjMtZpHUk5J+5/SD8oVlorD/j+l/Uj+2t4IyeNpFC viw3YeOmZe4ucf/3ddvM2XQlDDBhR4i1GkM9z/uxy1Nr/M3h0yJH649nm4e+erl2e/Eo ztqx24vMQspDbcUJjdpAMeY1UespFH/PeuXJEz4QjifqJLLSja99Ppc5EmVYeQ/9IvdV xj9sj2gN8bDznWn7E1xPDEwWWNON7a7oD5sTq8yhQQ/nRxj3XxQgrxXpK0l3ldXDPj4Z gahw== X-Gm-Message-State: ALoCoQn2d5COtarlkuI04oZkCZfBjxyuqq6ehXDXfBqlgFjV+80OVGAAm9yvIc/8PmTncBsFhGYf X-Received: by 10.194.21.106 with SMTP id u10mr5184888wje.109.1414180335160; Fri, 24 Oct 2014 12:52:15 -0700 (PDT) Received: from [192.168.1.112] (cpe-70-113-52-246.austin.res.rr.com. [70.113.52.246]) by mx.google.com with ESMTPSA id gw6sm3053066wib.8.2014.10.24.12.52.13 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 24 Oct 2014 12:52:14 -0700 (PDT) From: graham sanderson Content-Type: multipart/signed; boundary="Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29"; protocol="application/pkcs7-signature"; micalg=sha1 Message-Id: <3461E2FD-C117-4CC9-95D9-5B7E67F1583B@vast.com> Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1990.1\)) Subject: Re: Intermittent long application pauses on nodes Date: Fri, 24 Oct 2014 14:52:11 -0500 References: <6AB17138-2316-408A-B4E5-6D19C0596A67@vast.com> To: user@cassandra.apache.org In-Reply-To: <6AB17138-2316-408A-B4E5-6D19C0596A67@vast.com> X-Mailer: Apple Mail (2.1990.1) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29 Content-Type: multipart/alternative; boundary="Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149" --Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 And -XX:SafepointTimeoutDelay=3Dxxx to set how long before it dumps output (defaults to 10000 I believe)=E2=80= =A6 Note it doesn=E2=80=99t actually timeout by default, it just prints the = problematic threads after that time and keeps on waiting > On Oct 24, 2014, at 2:44 PM, graham sanderson wrote: >=20 > Actually - there is=20 >=20 > -XX:+SafepointTimeout >=20 > which will print out offending threads (assuming you reach a 10 second = pause)=E2=80=A6 >=20 > That is probably your best bet. >=20 >> On Oct 24, 2014, at 2:38 PM, graham sanderson > wrote: >>=20 >> This certainly sounds like a JVM bug. >>=20 >> We are running C* 2.0.9 on pretty high end machines with pretty large = heaps, and don=E2=80=99t seem to have seen this (note we are on 7u67, so = that might be an interesting data point, though since the old thread = predated that probably not) >>=20 >> 1) =46rom the app/java side, I=E2=80=99d obviously see if you can = identify anything which always coincides with this - repair, compaction = etc >> 2) =46rom the VM side (given that this as Benedict mentioned) some = threads are taking a long time to rendezvous at the safe point, and it = is probably not application threads, I=E2=80=99d look what GC threads, = compiler threads etc might be doing. As mentioned it shouldn=E2=80=99t = be anything to do with operations which run at a safe point anyway (e.g. = scavenge) >> a) So look at what CMS is doing at the time and see if you can = correlate >> b) Check Oracle for related bugs - didn=E2=80=99t obviously see = any, but there have been some complaints related to compilation and safe = points >> c) Add any compilation tracing you can >> d) Kind of important here - see if you can figure out via = dtrace, system tap, gdb or whatever, what the threads are doing when = this happens. Sadly it doesn=E2=80=99t look like you can figure out when = this is happening (until afterwards) unless you have access to a debug = JVM build (and can turn on -XX:+TraceSafepoint and look for a safe point = start without a corresponding update within a time period) - if you = don=E2=80=99t have access to that, I guess you could try and get a dump = every 2-3 seconds (you should catch a 9 second pause eventually!) >>=20 >>> On Oct 24, 2014, at 12:35 PM, Dan van Kley > wrote: >>>=20 >>> I'm also curious to know if this was ever resolved or if there's any = other recommended steps to take to continue to track it down. I'm seeing = the same issue in our production cluster, which is running Cassandra = 2.0.10 and JVM 1.7u71, using the CMS collector. Just as described above, = the issue is long "Total time for which application threads were = stopped" pauses that are not a direct result of GC pauses (ParNew, = initial mark or remark). When I enabled the safepoint logging I saw the = same result, long "sync" pause times with short spin and block times, = usually with the "RevokeBias" description. We're seeing pause times = sometimes in excess of 10 seconds, so it's a pretty debilitating issue. = Our machines are not swapping (or even close to it) or having other load = issues when these pauses occur. Any ideas would be very appreciated. = Thanks! >>=20 >=20 --Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 And -XX:SafepointTimeoutDelay=3Dxxx

to set how long before it dumps output = (defaults to 10000 I believe)=E2=80=A6

Note it doesn=E2=80=99t actually = timeout by default, it just prints the problematic threads after that = time and keeps on waiting

On = Oct 24, 2014, at 2:44 PM, graham sanderson <graham@vast.com> = wrote:

Actually - = there is 

-XX:+SafepointTimeout

which will print out offending threads = (assuming you reach a 10 second pause)=E2=80=A6

That is probably your best = bet.

On Oct 24, 2014, at 2:38 PM, = graham sanderson <graham@vast.com> wrote:

This certainly = sounds like a JVM bug.

We are running C* 2.0.9 on pretty high = end machines with pretty large heaps, and don=E2=80=99t seem to have = seen this (note we are on 7u67, so that might be an interesting data = point, though since the old thread predated that probably not)

1) =46rom the app/java = side, I=E2=80=99d obviously see if you can identify anything which = always coincides with this - repair, compaction etc
2) =46rom the VM side (given that this as Benedict mentioned) = some threads are taking a long time to rendezvous at the safe point, and = it is probably not application threads, I=E2=80=99d look what GC = threads, compiler threads etc might be doing. As mentioned it = shouldn=E2=80=99t be anything to do with operations which run at a safe = point anyway (e.g. scavenge)
a) So = look at what CMS is doing at the time and see if you can = correlate
b) Check Oracle for related bugs = - didn=E2=80=99t obviously see any, but there have been some complaints = related to compilation and safe points
c) Add = any compilation tracing you can
d) Kind = of important here - see if you can figure out via dtrace, system tap, = gdb or whatever, what the threads are doing when this happens. Sadly it = doesn=E2=80=99t look like you can figure out when this is happening = (until afterwards) unless you have access to a debug JVM build (and can = turn on -XX:+TraceSafepoint and look for a safe point start without a = corresponding update within a time period) - if you don=E2=80=99t have = access to that, I guess you could try and get a dump every 2-3 seconds = (you should catch a 9 second pause eventually!)

On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvankley@salesforce.com> wrote:

I'm also curious to know if this was ever resolved or if = there's any other recommended steps to take to continue to track it = down. I'm seeing the same issue in our production cluster, which is = running Cassandra 2.0.10 and JVM 1.7u71, using the CMS collector. Just = as described above, the issue is long "Total time for which application = threads were stopped" pauses that are not a direct result of GC pauses = (ParNew, initial mark or remark). When I enabled the safepoint logging I = saw the same result, long "sync" pause times with short spin and block = times, usually with the "RevokeBias" description. We're seeing pause = times sometimes in excess of 10 seconds, so it's a pretty debilitating = issue. Our machines are not swapping (or even close to it) or having = other load issues when these pauses occur. Any ideas would be very = appreciated. Thanks!



= --Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149-- --Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29 Content-Disposition: attachment; filename=smime.p7s Content-Type: application/pkcs7-signature; name=smime.p7s Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICuzCCArcw ggIgAgIBTDANBgkqhkiG9w0BAQUFADCBojELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAk9SMREwDwYD VQQHEwhQb3J0bGFuZDEWMBQGA1UEChMNT21uaS1FeHBsb3JlcjEWMBQGA1UECxMNSVQgRGVwYXJ0 bWVudDEbMBkGA1UEAxMSd3d3LmNvcm5lcmNhc2UuY29tMSYwJAYJKoZIhvcNAQkBFhdibG9ja291 dEBjb3JuZXJjYXNlLmNvbTAeFw0xMTA0MDYxNjE0MzFaFw0yMTA0MDMxNjE0MzFaMIGjMQswCQYD VQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UEBxMNU2FuIEZyYW5jaXNjbzEWMBQG A1UEChMNVmFzdC5jb20gSW5jLjEUMBIGA1UECxMLRW5naW5lZXJpbmcxGTAXBgNVBAMTEEdyYWhh bSBTYW5kZXJzb24xHjAcBgkqhkiG9w0BCQEWD2dyYWhhbUB2YXN0LmNvbTCBnzANBgkqhkiG9w0B AQEFAAOBjQAwgYkCgYEAm4K/W/0VdaOiS6tC1G8tSCAw989XCsJXxVPiny/hND6T0jVv4vP0JRiO vNzH6uoINoKQfgUKa+GCqILdY7Jdx61/WKqxltFTu5D0H8sFFNIKgf9cd3yU6t2susKrxaDXRCul pmcJ3AFg4xuG3ZUZt+XTYhBebQfjwgGQh3/pkQUCAwEAATANBgkqhkiG9w0BAQUFAAOBgQCKW+hQ JqNkPRht5fl8FHku80BLAH9ezEJtZJ6EU9fcK9jNPkAJgSEgPXQ++jE+4iYI2nIb/h5RILUxd1Ht m/yZkNRUVCg0+0Qj6aMT/hfOT0kdP8/9OnbmIp2T6qvNN2rAGU58tt3cbuT2j3LMTS2VOGykK4He iNYYqr+K6sPDHTGCAy0wggMpAgEBMIGpMIGiMQswCQYDVQQGEwJVUzELMAkGA1UECBMCT1IxETAP BgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYDVQQLEw1JVCBEZXBh cnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG9w0BCQEWF2Jsb2Nr b3V0QGNvcm5lcmNhc2UuY29tAgIBTDAJBgUrDgMCGgUAoIIB2TAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDEwMjQxOTUyMTFaMCMGCSqGSIb3DQEJBDEWBBTKOe9F fmckDWRtYaQ0xoqsk8ZZbDCBugYJKwYBBAGCNxAEMYGsMIGpMIGiMQswCQYDVQQGEwJVUzELMAkG A1UECBMCT1IxETAPBgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYD VQQLEw1JVCBEZXBhcnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG 9w0BCQEWF2Jsb2Nrb3V0QGNvcm5lcmNhc2UuY29tAgIBTDCBvAYLKoZIhvcNAQkQAgsxgayggakw gaIxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJPUjERMA8GA1UEBxMIUG9ydGxhbmQxFjAUBgNVBAoT DU9tbmktRXhwbG9yZXIxFjAUBgNVBAsTDUlUIERlcGFydG1lbnQxGzAZBgNVBAMTEnd3dy5jb3Ju ZXJjYXNlLmNvbTEmMCQGCSqGSIb3DQEJARYXYmxvY2tvdXRAY29ybmVyY2FzZS5jb20CAgFMMA0G CSqGSIb3DQEBAQUABIGAeUHI5/sA1ks93SsyI8imTnGliAqE8inLxzwn1/MaOtCMYZLkj19BhbHJ cYTGh9wjgrbHM56OHS6m8EiTW80k8I/tbemU4KUaVqJt//UUA+kS/6P8DaRdV8enrGQ+Tncvx3T7 /MwMINREELVxwwVxSNBC1SndxmlQSvR7ncdCnBgAAAAAAAA= --Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29--