Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Content-Type: multipart/signed;
 boundary="Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372";
 protocol="application/pkcs7-signature"; micalg=sha1
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Subject: Re: GC pauses affecting entire cluster.
From: graham sanderson <graham@vast.com>
In-Reply-To: 
 <CAOqhZJiJ=mWzagd8PPXDLst5iK8R+knTScb3Q6CU8QeHJXVMag@mail.gmail.com>
Date: Mon, 1 Jun 2015 16:29:03 -0500
Cc: Anuj Wadehra <anujw_2003@yahoo.co.in>
Message-Id: <8AD62DCF-52C5-4762-AE35-85EA021D57AA@vast.com>
References: 
 <CAOqhZJhK0Cx4ftn=v-NhZf_EnDU_7az-m_CQh6ChGXEAW_sKgA@mail.gmail.com>
 <498013394.1663504.1433168420320.JavaMail.yahoo@mail.yahoo.com>
 <CAOqhZJiJ=mWzagd8PPXDLst5iK8R+knTScb3Q6CU8QeHJXVMag@mail.gmail.com>
To: user@cassandra.apache.org


--Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF"


--Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Yes native_objects is the way to go=E2=80=A6 you can tell if memtables =
are you problem because you=E2=80=99ll see promotion failures of objects =
sized 131074 dwords.

If your h/w is fast enough make your young gen as big as possible - we =
can collect 8G in sub second always, and this gives you your best chance =
of transient objects (especially if you still have thrift clients) =
leaking into the old gen. Moving to 2.1.x (and off heap memtables) from =
2.0.x we have reduced our old gen down from 16gig to 12gig and will keep =
shrinking it, but have had no promotion failures yet, and it=E2=80=99s =
been several months.

Note we are running a patched 2.1.3, but 2.1.5 has the equivalent =
important bugs fixed (that might have given you memory issues)

> On Jun 1, 2015, at 3:00 PM, Carl Hu <me@carlhu.com> wrote:
>=20
> Thank you for the suggestion. After analysis of your settings, the =
basic hypothesis here is to promote very quickly to Old Gen because of a =
rapid accumulation of heap usage due to memtables. We happen to be =
running on 2.1, and I thought a more conservative approach that your =
(quite aggressive gc settings) is to try the new =
memtable_allocation_type with offheap_objects and see if the memtable =
pressure is relieved sufficiently such that the standard gc settings can =
keep up.
>=20
> The experiment is in progress and I will report back with the results.
>=20
> On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra <anujw_2003@yahoo.co.in =
<mailto:anujw_2003@yahoo.co.in>> wrote:
> We have write heavy workload and used to face promotion failures/long =
gc pauses with Cassandra 2.0.x. I am not into code yet but I think that =
memtable and compaction related objects have mid-life and write heavy =
workload is not suitable for generation collection by default. So, we =
tuned JVM to make sure that minimum objects are promoted to Old Gen and =
achieved great success in that:
> MAX_HEAP_SIZE=3D"12G"
> HEAP_NEWSIZE=3D"3G"
> -XX:SurvivorRatio=3D2
> -XX:MaxTenuringThreshold=3D20
> -XX:CMSInitiatingOccupancyFraction=3D70
> JVM_OPTS=3D"$JVM_OPTS -XX:ConcGCThreads=3D20"
> JVM_OPTS=3D"$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
> JVM_OPTS=3D"$JVM_OPTS -XX:+UseGCTaskAffinity"
> JVM_OPTS=3D"$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
> JVM_OPTS=3D"$JVM_OPTS -XX:ParGCCardsPerStrideChunk=3D32768"
> JVM_OPTS=3D"$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
> JVM_OPTS=3D"$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=3D30000"
> JVM_OPTS=3D"$JVM_OPTS -XX:CMSWaitDuration=3D2000"
> JVM_OPTS=3D"$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
> JVM_OPTS=3D"$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
> JVM_OPTS=3D"$JVM_OPTS -XX:-UseBiasedLocking"
> We also think that default total_memtable_space_in_mb=3D1/4 heap is =
too much for write heavy loads. By default, young gen is also 1/4 =
heap.We reduced it to 1000mb in order to make sure that memtable related =
objects dont stay in memory for too long. Combining this with =
SurvivorRatio=3D2 and MaxTenuringThreshold=3D20 did the job well. GC was =
very consistent. No Full GC observed.
>=20
> Environment: 3 node cluster with each node having 24cores,64G RAM and =
SSDs in RAID5.
> We are making around 12k writes/sec in 5 cf (one with 4 sec index) and =
2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with =
max data of around 100mb per row
>=20
> Yes. Node marking down has cascading effect. Within seconds all nodes =
in our cluster are marked down.=20
>=20
> Thanks
> Anuj Wadehra
>=20
>=20
>=20
> On Monday, 1 June 2015 7:12 PM, Carl Hu <me@carlhu.com =
<mailto:me@carlhu.com>> wrote:
>=20
>=20
> We are running Cassandra version 2.1.5.469 on 15 nodes and are =
experiencing a problem where the entire cluster slows down for 2.5 =
minutes when one node experiences a 17 second stop-the-world gc. These =
gc's happen once every 2 hours. I did find a ticket that seems related =
to this: https://issues.apache.org/jira/browse/CASSANDRA-3853 =
<https://issues.apache.org/jira/browse/CASSANDRA-3853>, but Jonathan =
Ellis has resolved this ticket.=20
>=20
> We are running standard gc settings, but this ticket is not so much =
concerned with the 17 second gc on a single node (after all, we have 14 =
others), but that the cascading performance problem.
>=20
> We running standard values of dynamic_snitch_badness_threshold (0.1) =
and phi_convict_threshold (8). (These values are relevant for the =
dynamic snitch routing requests away from the frozen node or the failure =
detector marking the node as 'down').
>=20
> We use the python client in default round robin mode, so all clients =
hits the coordinators at all nodes in round robin. One theory is that =
since the coordinator on all nodes must hit the frozen node at some =
point in the 17 seconds, each node's request queues fills up, and the =
entire cluster thus freezes up. That would explain a 17 second freeze =
but would not explain the 2.5 minute slowdown (10x increase in request =
latency @P50).
>=20
> I'd love your thoughts. I've provided the GC chart here.
>=20
> Carl
>=20
> <d2c95dce-0848-11e5-91f7-6b223349fc14.png>
>=20
>=20
>=20


--Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">Yes native_objects is the way to go=E2=80=A6 you can tell if =
memtables are you problem because you=E2=80=99ll see promotion failures =
of objects sized 131074 dwords.<div class=3D""><br class=3D""></div><div =
class=3D"">If your h/w is fast enough make your young gen as big as =
possible - we can collect 8G in sub second always, and this gives you =
your best chance of transient objects (especially if you still have =
thrift clients) leaking into the old gen. Moving to 2.1.x (and off heap =
memtables) from 2.0.x we have reduced our old gen down from 16gig to =
12gig and will keep shrinking it, but have had no promotion failures =
yet, and it=E2=80=99s been several months.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Note we are running a patched 2.1.3, =
but 2.1.5 has the equivalent important bugs fixed (that might have given =
you memory issues)</div><div class=3D""><br class=3D""><div><blockquote =
type=3D"cite" class=3D""><div class=3D"">On Jun 1, 2015, at 3:00 PM, =
Carl Hu &lt;<a href=3D"mailto:me@carlhu.com" =
class=3D"">me@carlhu.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">Thank you for the suggestion. After analysis of your =
settings, the basic hypothesis here is to promote very quickly to Old =
Gen because of a rapid accumulation of heap usage due to memtables. We =
happen to be running on 2.1, and I thought a more conservative approach =
that your (quite aggressive gc settings) is to try the =
new&nbsp;memtable_allocation_type with offheap_objects and see if the =
memtable pressure is relieved sufficiently such that the standard gc =
settings can keep up.<div class=3D""><br class=3D""></div><div =
class=3D"">The experiment is in progress and I will report back with the =
results.</div></div><div class=3D"gmail_extra"><br class=3D""><div =
class=3D"gmail_quote">On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra =
<span dir=3D"ltr" class=3D"">&lt;<a href=3D"mailto:anujw_2003@yahoo.co.in"=
 target=3D"_blank" class=3D"">anujw_2003@yahoo.co.in</a>&gt;</span> =
wrote:<br class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 =
0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=3D""><div=
 style=3D"background-color: rgb(255, 255, 255); font-family: garamond, =
'new york', times, serif; font-size: 12px;" class=3D""><div =
class=3D""><div class=3D""><font size=3D"3" class=3D"">We have write =
heavy workload and=20
used to face promotion failures/long gc pauses with Cassandra 2.0.x. I=20=

am not into code yet but I think that memtable and compaction related=20
objects have mid-life and write heavy workload is not suitable for=20
generation collection by default. So, we tuned JVM to make sure that=20
minimum objects are promoted to Old Gen and achieved great success in=20
that:</font></div>

<div class=3D""><font size=3D"3" class=3D"">MAX_HEAP_SIZE=3D"12G"<br =
class=3D"">
HEAP_NEWSIZE=3D"3G"<br class=3D"">
-XX:SurvivorRatio=3D2<br class=3D"">
-XX:MaxTenuringThreshold=3D20<br class=3D"">
-XX:CMSInitiatingOccupancyFraction=3D70<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:ConcGCThreads=3D20"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:+UseGCTaskAffinity"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:ParGCCardsPerStrideChunk=3D32768"<br class=3D"">=

JVM_OPTS=3D"$JVM_OPTS -XX:+CMSScavengeBeforeRemark"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=3D30000"<br =
class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:CMSWaitDuration=3D2000"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"<br class=3D"">
JVM_OPTS=3D"$JVM_OPTS -XX:-UseBiasedLocking"</font></div>

<div class=3D""><font size=3D"3" class=3D"">We also think that default =
total_memtable_space_in_mb=3D1/4 heap is too
 much for write heavy loads. By default, young gen is also 1/4 heap.We=20=

reduced it to 1000mb in order to make sure that memtable related objects
 dont stay in memory for too long. Combining this with SurvivorRatio=3D2=20=

and MaxTenuringThreshold=3D20 did the job well. GC was very consistent. =
No
 Full GC observed.</font></div>

<div class=3D""><font size=3D"3" class=3D""><br =
class=3D""></font></div><div class=3D""><font size=3D"3" =
class=3D"">Environment: 3 node cluster with each node having 24cores,64G =
RAM and SSDs in RAID5.</font></div>

<div class=3D""><font size=3D"3" class=3D"">We are making around 12k =
writes/sec in 5 cf (one with 4 sec index)=20
and 2300 reads/sec on each node of 3 node cluster. 2 CFs have wide rows=20=

with max data of around 100mb per row</font></div><div dir=3D"ltr" =
class=3D""> </div></div><div class=3D""><font size=3D"3" class=3D""><br =
class=3D""><span class=3D""></span></font></div><div class=3D""><font =
size=3D"3" class=3D"">Yes. Node marking down has cascading effect. =
Within seconds all nodes in our cluster are marked down. <br =
class=3D""></font></div><div class=3D""><font size=3D"3" class=3D""><br =
class=3D""><span class=3D""></span></font></div><div class=3D""><font =
size=3D"3" class=3D"">Thanks</font></div><span class=3D"HOEnZb"><font =
color=3D"#888888" class=3D""><div class=3D""><span class=3D""><font =
size=3D"3" class=3D"">Anuj</font> <font size=3D"3" =
class=3D"">Wadehra</font></span></div></font></span><div class=3D""><div =
class=3D"h5">  <br class=3D""><div class=3D""><br class=3D""><br =
class=3D""></div><div style=3D"display:block" class=3D""> <div =
style=3D"font-family:garamond,new york,times,serif;font-size:12px" =
class=3D""> <div style=3D"font-family:HelveticaNeue,Helvetica =
Neue,Helvetica,Arial,Lucida Grande,Sans-Serif;font-size:16px" class=3D""> =
<div dir=3D"ltr" class=3D""> <font size=3D"2" face=3D"Arial" class=3D""> =
On Monday, 1 June 2015 7:12 PM, Carl Hu &lt;<a =
href=3D"mailto:me@carlhu.com" target=3D"_blank" =
class=3D"">me@carlhu.com</a>&gt; wrote:<br class=3D""> </font> </div>  =
<br class=3D""><br class=3D""> <div class=3D""><div class=3D""><div =
dir=3D"ltr" class=3D""><div class=3D"">We are running Cassandra version =
2.1.5.469 on 15 nodes and are experiencing a problem where the entire =
cluster slows down for 2.5 minutes when one node experiences a 17 second =
stop-the-world gc. These gc's happen once every 2 hours. I did find a =
ticket that seems related to this: <a rel=3D"nofollow" =
href=3D"https://issues.apache.org/jira/browse/CASSANDRA-3853" =
target=3D"_blank" =
class=3D"">https://issues.apache.org/jira/browse/CASSANDRA-3853</a>, but =
Jonathan Ellis has resolved this ticket.&nbsp;</div><div class=3D""><br =
class=3D""></div><div class=3D"">We are running standard gc settings, =
but this ticket is not so much concerned with the 17 second gc on a =
single node (after all, we have 14 others), but that the cascading =
performance problem.</div><div class=3D""><br class=3D""></div><div =
class=3D"">We running standard values of =
dynamic_snitch_badness_threshold (0.1) and phi_convict_threshold (8). =
(These values are relevant for the dynamic snitch routing requests away =
from the frozen node or the failure detector marking the node as =
'down').</div><div class=3D""><br class=3D""></div><div class=3D"">We =
use the python client in default round robin mode, so all clients hits =
the coordinators at all nodes in round robin. One theory is that since =
the coordinator on all nodes must hit the frozen node at some point in =
the 17 seconds, each node's request queues fills up, and the entire =
cluster thus freezes up. That would explain a 17 second freeze but would =
not explain the 2.5 minute slowdown (10x increase in request latency =
@P50).</div><div class=3D""><br class=3D""></div><div class=3D"">I'd =
love your thoughts. I've provided the GC chart here.</div><div =
class=3D""><br class=3D""></div><div class=3D"">Carl</div><div =
class=3D""><br class=3D""></div><div class=3D""><span =
id=3D"cid:ii_14daf566e31983e9">&lt;d2c95dce-0848-11e5-91f7-6b223349fc14.pn=
g&gt;</span></div></div></div><br class=3D""><br class=3D""></div>  =
</div> </div>  </div></div></div></div></div></blockquote></div><br =
class=3D""></div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_CA9C047E-E04C-48D5-9F7B-15F1A396BBBF--

--Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372
Content-Disposition: attachment;
	filename=smime.p7s
Content-Type: application/pkcs7-signature;
	name=smime.p7s
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICuzCCArcw
ggIgAgIBTDANBgkqhkiG9w0BAQUFADCBojELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAk9SMREwDwYD
VQQHEwhQb3J0bGFuZDEWMBQGA1UEChMNT21uaS1FeHBsb3JlcjEWMBQGA1UECxMNSVQgRGVwYXJ0
bWVudDEbMBkGA1UEAxMSd3d3LmNvcm5lcmNhc2UuY29tMSYwJAYJKoZIhvcNAQkBFhdibG9ja291
dEBjb3JuZXJjYXNlLmNvbTAeFw0xMTA0MDYxNjE0MzFaFw0yMTA0MDMxNjE0MzFaMIGjMQswCQYD
VQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UEBxMNU2FuIEZyYW5jaXNjbzEWMBQG
A1UEChMNVmFzdC5jb20gSW5jLjEUMBIGA1UECxMLRW5naW5lZXJpbmcxGTAXBgNVBAMTEEdyYWhh
bSBTYW5kZXJzb24xHjAcBgkqhkiG9w0BCQEWD2dyYWhhbUB2YXN0LmNvbTCBnzANBgkqhkiG9w0B
AQEFAAOBjQAwgYkCgYEAm4K/W/0VdaOiS6tC1G8tSCAw989XCsJXxVPiny/hND6T0jVv4vP0JRiO
vNzH6uoINoKQfgUKa+GCqILdY7Jdx61/WKqxltFTu5D0H8sFFNIKgf9cd3yU6t2susKrxaDXRCul
pmcJ3AFg4xuG3ZUZt+XTYhBebQfjwgGQh3/pkQUCAwEAATANBgkqhkiG9w0BAQUFAAOBgQCKW+hQ
JqNkPRht5fl8FHku80BLAH9ezEJtZJ6EU9fcK9jNPkAJgSEgPXQ++jE+4iYI2nIb/h5RILUxd1Ht
m/yZkNRUVCg0+0Qj6aMT/hfOT0kdP8/9OnbmIp2T6qvNN2rAGU58tt3cbuT2j3LMTS2VOGykK4He
iNYYqr+K6sPDHTGCAy0wggMpAgEBMIGpMIGiMQswCQYDVQQGEwJVUzELMAkGA1UECBMCT1IxETAP
BgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYDVQQLEw1JVCBEZXBh
cnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG9w0BCQEWF2Jsb2Nr
b3V0QGNvcm5lcmNhc2UuY29tAgIBTDAJBgUrDgMCGgUAoIIB2TAYBgkqhkiG9w0BCQMxCwYJKoZI
hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA2MDEyMTI5MDRaMCMGCSqGSIb3DQEJBDEWBBR0wV7S
5aa0H2ycbmMX8g000mD2cDCBugYJKwYBBAGCNxAEMYGsMIGpMIGiMQswCQYDVQQGEwJVUzELMAkG
A1UECBMCT1IxETAPBgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYD
VQQLEw1JVCBEZXBhcnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG
9w0BCQEWF2Jsb2Nrb3V0QGNvcm5lcmNhc2UuY29tAgIBTDCBvAYLKoZIhvcNAQkQAgsxgayggakw
gaIxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJPUjERMA8GA1UEBxMIUG9ydGxhbmQxFjAUBgNVBAoT
DU9tbmktRXhwbG9yZXIxFjAUBgNVBAsTDUlUIERlcGFydG1lbnQxGzAZBgNVBAMTEnd3dy5jb3Ju
ZXJjYXNlLmNvbTEmMCQGCSqGSIb3DQEJARYXYmxvY2tvdXRAY29ybmVyY2FzZS5jb20CAgFMMA0G
CSqGSIb3DQEBAQUABIGAYX9c72JNWGl4D2dPQhEcWgj71+yM0WCJh6GohRlJOIt73Sk+5MgTuDVD
YyJK1T3Mz4Jiw0fYnzP/WdzC0PxhtVxHWwVrnjOM/Jr+YQVHikjmK127TOnS7S1OY9J3/le6azo7
u2oXoFlEPcbSw3H74SH2M1qsrXLsx0Lksf4BIZIAAAAAAAA=
--Apple-Mail=_F309547F-4270-4501-B17B-82A539B5A372--