Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of graham@vast.com designates
 209.85.212.170 as permitted sender)
From: graham sanderson <graham@vast.com>
Content-Type: multipart/signed;
 boundary="Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29";
 protocol="application/pkcs7-signature"; micalg=sha1
Message-Id: <3461E2FD-C117-4CC9-95D9-5B7E67F1583B@vast.com>
Mime-Version: 1.0 (Mac OS X Mail 8.0 \(1990.1\))
Subject: Re: Intermittent long application pauses on nodes
Date: Fri, 24 Oct 2014 14:52:11 -0500
References: 
 <CAETT45gEpp_5MWeQ-xnih3dRHDrg1LQR3hAK4ygPwU_o0pCMxg@mail.gmail.com>
 <DBB5E11C-2F14-4086-905B-896971D61F7E@vast.com>
 <6AB17138-2316-408A-B4E5-6D19C0596A67@vast.com>
To: user@cassandra.apache.org
In-Reply-To: <6AB17138-2316-408A-B4E5-6D19C0596A67@vast.com>


--Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149"


--Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

And -XX:SafepointTimeoutDelay=3Dxxx

to set how long before it dumps output (defaults to 10000 I believe)=E2=80=
=A6

Note it doesn=E2=80=99t actually timeout by default, it just prints the =
problematic threads after that time and keeps on waiting

> On Oct 24, 2014, at 2:44 PM, graham sanderson <graham@vast.com> wrote:
>=20
> Actually - there is=20
>=20
> -XX:+SafepointTimeout
>=20
> which will print out offending threads (assuming you reach a 10 second =
pause)=E2=80=A6
>=20
> That is probably your best bet.
>=20
>> On Oct 24, 2014, at 2:38 PM, graham sanderson <graham@vast.com =
<mailto:graham@vast.com>> wrote:
>>=20
>> This certainly sounds like a JVM bug.
>>=20
>> We are running C* 2.0.9 on pretty high end machines with pretty large =
heaps, and don=E2=80=99t seem to have seen this (note we are on 7u67, so =
that might be an interesting data point, though since the old thread =
predated that probably not)
>>=20
>> 1) =46rom the app/java side, I=E2=80=99d obviously see if you can =
identify anything which always coincides with this - repair, compaction =
etc
>> 2) =46rom the VM side (given that this as Benedict mentioned) some =
threads are taking a long time to rendezvous at the safe point, and it =
is probably not application threads, I=E2=80=99d look what GC threads, =
compiler threads etc might be doing. As mentioned it shouldn=E2=80=99t =
be anything to do with operations which run at a safe point anyway (e.g. =
scavenge)
>> 	a) So look at what CMS is doing at the time and see if you can =
correlate
>> 	b) Check Oracle for related bugs - didn=E2=80=99t obviously see =
any, but there have been some complaints related to compilation and safe =
points
>> 	c) Add any compilation tracing you can
>> 	d) Kind of important here - see if you can figure out via =
dtrace, system tap, gdb or whatever, what the threads are doing when =
this happens. Sadly it doesn=E2=80=99t look like you can figure out when =
this is happening (until afterwards) unless you have access to a debug =
JVM build (and can turn on -XX:+TraceSafepoint and look for a safe point =
start without a corresponding update within a time period) - if you =
don=E2=80=99t have access to that, I guess you could try and get a dump =
every 2-3 seconds (you should catch a 9 second pause eventually!)
>>=20
>>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvankley@salesforce.com =
<mailto:dvankley@salesforce.com>> wrote:
>>>=20
>>> I'm also curious to know if this was ever resolved or if there's any =
other recommended steps to take to continue to track it down. I'm seeing =
the same issue in our production cluster, which is running Cassandra =
2.0.10 and JVM 1.7u71, using the CMS collector. Just as described above, =
the issue is long "Total time for which application threads were =
stopped" pauses that are not a direct result of GC pauses (ParNew, =
initial mark or remark). When I enabled the safepoint logging I saw the =
same result, long "sync" pause times with short spin and block times, =
usually with the "RevokeBias" description. We're seeing pause times =
sometimes in excess of 10 seconds, so it's a pretty debilitating issue. =
Our machines are not swapping (or even close to it) or having other load =
issues when these pauses occur. Any ideas would be very appreciated. =
Thanks!
>>=20
>=20


--Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">And -XX:SafepointTimeoutDelay=3Dxxx<div class=3D""><br =
class=3D""></div><div class=3D"">to set how long before it dumps output =
(defaults to 10000 I believe)=E2=80=A6</div><div class=3D""><br =
class=3D""></div><div class=3D"">Note it doesn=E2=80=99t actually =
timeout by default, it just prints the problematic threads after that =
time and keeps on waiting</div><div class=3D""><br =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">On =
Oct 24, 2014, at 2:44 PM, graham sanderson &lt;<a =
href=3D"mailto:graham@vast.com" class=3D"">graham@vast.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><meta =
http-equiv=3D"Content-Type" content=3D"text/html charset=3Dutf-8" =
class=3D""><div style=3D"word-wrap: break-word; -webkit-nbsp-mode: =
space; -webkit-line-break: after-white-space;" class=3D"">Actually - =
there is&nbsp;<div class=3D""><br class=3D""></div><div =
class=3D"">-XX:+SafepointTimeout</div><div class=3D""><br =
class=3D""></div><div class=3D"">which will print out offending threads =
(assuming you reach a 10 second pause)=E2=80=A6</div><div class=3D""><br =
class=3D""></div><div class=3D"">That is probably your best =
bet.</div><div class=3D""><br class=3D""><div class=3D""><blockquote =
type=3D"cite" class=3D""><div class=3D"">On Oct 24, 2014, at 2:38 PM, =
graham sanderson &lt;<a href=3D"mailto:graham@vast.com" =
class=3D"">graham@vast.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><meta =
http-equiv=3D"Content-Type" content=3D"text/html charset=3Dutf-8" =
class=3D""><div style=3D"word-wrap: break-word; -webkit-nbsp-mode: =
space; -webkit-line-break: after-white-space;" class=3D"">This certainly =
<i class=3D"">sounds</i> like a JVM bug.<div class=3D""><br =
class=3D""></div><div class=3D"">We are running C* 2.0.9 on pretty high =
end machines with pretty large heaps, and don=E2=80=99t seem to have =
seen this (note we are on 7u67, so that might be an interesting data =
point, though since the old thread predated that probably not)</div><div =
class=3D""><br class=3D""></div><div class=3D"">1) =46rom the app/java =
side, I=E2=80=99d obviously see if you can identify anything which =
always coincides with this - repair, compaction etc</div><div =
class=3D"">2) =46rom the VM side (given that this as Benedict mentioned) =
some threads are taking a long time to rendezvous at the safe point, and =
it is probably not application threads, I=E2=80=99d look what GC =
threads, compiler threads etc might be doing. As mentioned it =
shouldn=E2=80=99t be anything to do with operations which run at a safe =
point anyway (e.g. scavenge)</div><div class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>a) So =
look at what CMS is doing at the time and see if you can =
correlate</div><div class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>b) Check Oracle for related bugs =
- didn=E2=80=99t obviously see any, but there have been some complaints =
related to compilation and safe points</div><div class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>c) Add =
any compilation tracing you can</div><div class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>d) Kind =
of important here - see if you can figure out via dtrace, system tap, =
gdb or whatever, what the threads are doing when this happens. Sadly it =
doesn=E2=80=99t look like you can figure out when this is happening =
(until afterwards) unless you have access to a debug JVM build (and can =
turn on -XX:+TraceSafepoint and look for a safe point start without a =
corresponding update within a time period) - if you don=E2=80=99t have =
access to that, I guess you could try and get a dump every 2-3 seconds =
(you should catch a 9 second pause eventually!)</div><div class=3D""><br =
class=3D""><div class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">On Oct 24, 2014, at 12:35 PM, Dan van Kley &lt;<a =
href=3D"mailto:dvankley@salesforce.com" =
class=3D"">dvankley@salesforce.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">I'm also curious to know if this was ever resolved or if =
there's any other recommended steps to take to continue to track it =
down. I'm seeing the same issue in our production cluster, which is =
running Cassandra 2.0.10 and JVM 1.7u71, using the CMS collector. Just =
as described above, the issue is long "Total time for which application =
threads were stopped" pauses that are not a direct result of GC pauses =
(ParNew, initial mark or remark). When I enabled the safepoint logging I =
saw the same result, long "sync" pause times with short spin and block =
times, usually with the "RevokeBias" description. We're seeing pause =
times sometimes in excess of 10 seconds, so it's a pretty debilitating =
issue. Our machines are not swapping (or even close to it) or having =
other load issues when these pauses occur. Any ideas would be very =
appreciated. Thanks!</div>
</div></blockquote></div><br =
class=3D""></div></div></div></blockquote></div><br =
class=3D""></div></div></div></blockquote></div><br =
class=3D""></div></body></html>=

--Apple-Mail=_F343178C-0756-4A2B-855B-9643B246D149--

--Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29
Content-Disposition: attachment;
	filename=smime.p7s
Content-Type: application/pkcs7-signature;
	name=smime.p7s
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICuzCCArcw
ggIgAgIBTDANBgkqhkiG9w0BAQUFADCBojELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAk9SMREwDwYD
VQQHEwhQb3J0bGFuZDEWMBQGA1UEChMNT21uaS1FeHBsb3JlcjEWMBQGA1UECxMNSVQgRGVwYXJ0
bWVudDEbMBkGA1UEAxMSd3d3LmNvcm5lcmNhc2UuY29tMSYwJAYJKoZIhvcNAQkBFhdibG9ja291
dEBjb3JuZXJjYXNlLmNvbTAeFw0xMTA0MDYxNjE0MzFaFw0yMTA0MDMxNjE0MzFaMIGjMQswCQYD
VQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UEBxMNU2FuIEZyYW5jaXNjbzEWMBQG
A1UEChMNVmFzdC5jb20gSW5jLjEUMBIGA1UECxMLRW5naW5lZXJpbmcxGTAXBgNVBAMTEEdyYWhh
bSBTYW5kZXJzb24xHjAcBgkqhkiG9w0BCQEWD2dyYWhhbUB2YXN0LmNvbTCBnzANBgkqhkiG9w0B
AQEFAAOBjQAwgYkCgYEAm4K/W/0VdaOiS6tC1G8tSCAw989XCsJXxVPiny/hND6T0jVv4vP0JRiO
vNzH6uoINoKQfgUKa+GCqILdY7Jdx61/WKqxltFTu5D0H8sFFNIKgf9cd3yU6t2susKrxaDXRCul
pmcJ3AFg4xuG3ZUZt+XTYhBebQfjwgGQh3/pkQUCAwEAATANBgkqhkiG9w0BAQUFAAOBgQCKW+hQ
JqNkPRht5fl8FHku80BLAH9ezEJtZJ6EU9fcK9jNPkAJgSEgPXQ++jE+4iYI2nIb/h5RILUxd1Ht
m/yZkNRUVCg0+0Qj6aMT/hfOT0kdP8/9OnbmIp2T6qvNN2rAGU58tt3cbuT2j3LMTS2VOGykK4He
iNYYqr+K6sPDHTGCAy0wggMpAgEBMIGpMIGiMQswCQYDVQQGEwJVUzELMAkGA1UECBMCT1IxETAP
BgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYDVQQLEw1JVCBEZXBh
cnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG9w0BCQEWF2Jsb2Nr
b3V0QGNvcm5lcmNhc2UuY29tAgIBTDAJBgUrDgMCGgUAoIIB2TAYBgkqhkiG9w0BCQMxCwYJKoZI
hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDEwMjQxOTUyMTFaMCMGCSqGSIb3DQEJBDEWBBTKOe9F
fmckDWRtYaQ0xoqsk8ZZbDCBugYJKwYBBAGCNxAEMYGsMIGpMIGiMQswCQYDVQQGEwJVUzELMAkG
A1UECBMCT1IxETAPBgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYD
VQQLEw1JVCBEZXBhcnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG
9w0BCQEWF2Jsb2Nrb3V0QGNvcm5lcmNhc2UuY29tAgIBTDCBvAYLKoZIhvcNAQkQAgsxgayggakw
gaIxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJPUjERMA8GA1UEBxMIUG9ydGxhbmQxFjAUBgNVBAoT
DU9tbmktRXhwbG9yZXIxFjAUBgNVBAsTDUlUIERlcGFydG1lbnQxGzAZBgNVBAMTEnd3dy5jb3Ju
ZXJjYXNlLmNvbTEmMCQGCSqGSIb3DQEJARYXYmxvY2tvdXRAY29ybmVyY2FzZS5jb20CAgFMMA0G
CSqGSIb3DQEBAQUABIGAeUHI5/sA1ks93SsyI8imTnGliAqE8inLxzwn1/MaOtCMYZLkj19BhbHJ
cYTGh9wjgrbHM56OHS6m8EiTW80k8I/tbemU4KUaVqJt//UUA+kS/6P8DaRdV8enrGQ+Tncvx3T7
/MwMINREELVxwwVxSNBC1SndxmlQSvR7ncdCnBgAAAAAAAA=
--Apple-Mail=_B70FAF36-7519-4FFE-906B-C9D42316DF29--