Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: 209.85.214.44 is neither permitted
 nor denied by domain of oberman@civicscience.com)
MIME-Version: 1.0
From: William Oberman <oberman@civicscience.com>
Date: Wed, 22 Jun 2011 08:33:58 -0400
Message-ID: <BANLkTimX6a2WFsU-OTpFwnvSu5KiXrYuaA@mail.gmail.com>
Subject: OOM (or, what settings to use on AWS large?)
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=bcaec5540054e6b15f04a64c2fbb

--bcaec5540054e6b15f04a64c2fbb
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I woke up this morning to all 4 of 4 of my cassandra instances reporting
they were down in my cluster.  I quickly started them all, and everything
seems fine.  I'm doing a postmortem now, but it appears they all OOM'd at
roughly the same time, which was not reported in any cassandra log, but I
discovered something in /var/log/kern that showed java died of oom(*).  In
amazon, I'm using large instances for cassandra, and they have no swap (as
recommended), so I have ~8GB of ram.  Should I use a different max mem
setting?  I'm using a stock rpm from riptano/datastax.  If I run "ps -aux" =
I
get:

/usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=3D42
-Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold=3D1
-XX:CMSInitiatingOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancyOnly
-Djava.net.preferIPv4Stack=3Dtrue -Djava.rmi.server.hostname=3DX.X.X.X
-Dcom.sun.management.jmxremote.port=3D8080
-Dcom.sun.management.jmxremote.ssl=3Dfalse
-Dcom.sun.management.jmxremote.authenticate=3Dfalse -Dmx4jaddress=3D0.0.0.0
-Dmx4jport=3D8081 -Dlog4j.configuration=3Dlog4j-server.properties
-Dlog4j.defaultInitOverride=3Dtrue
-Dcassandra-pidfile=3D/var/run/cassandra/cassandra.pid -cp
:/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/ca=
ssandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-=
fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/=
cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.=
2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cas=
sandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedha=
shmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassand=
ra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.j=
ar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassand=
ra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr=
/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-=
1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/lib=
thrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassand=
ra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar=
:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4=
j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar
org.apache.cassandra.thrift.CassandraDaemon

(*) Also, why would they all OOM so close to each other?  Bad luck?  Or onc=
e
the first node went down, is there an increased chance of the rest?

I'm still on 0.7.4, when I released cassandra to production that was the
latest release.  In addition to (or instead of?) fixing memory settings, I'=
m
guessing I should upgrade.

will

--bcaec5540054e6b15f04a64c2fbb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I woke up this morning to all 4 of 4 of my cassandra instances reporting th=
ey were down in my cluster. =A0I quickly started them all, and everything s=
eems fine. =A0I&#39;m doing a=A0postmortem=A0now, but it appears they all O=
OM&#39;d at roughly the same time, which was not reported in any cassandra =
log, but I discovered something in /var/log/kern that showed java died of o=
om(*). =A0In amazon, I&#39;m using large instances for cassandra, and they =
have no swap (as recommended), so I have ~8GB of ram. =A0Should I use a dif=
ferent max mem setting? =A0I&#39;m using a stock rpm from riptano/datastax.=
 =A0If I run &quot;ps -aux&quot; I get:
<div><br></div><div>/usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPr=
iorityPolicy=3D42 -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryEr=
ror -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemar=
kEnabled -XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold=3D1 -XX:CMSInitiati=
ngOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.pref=
erIPv4Stack=3Dtrue -Djava.rmi.server.hostname=3DX.X.X.X -Dcom.sun.managemen=
t.jmxremote.port=3D8080 -Dcom.sun.management.jmxremote.ssl=3Dfalse -Dcom.su=
n.management.jmxremote.authenticate=3Dfalse -Dmx4jaddress=3D0.0.0.0 -Dmx4jp=
ort=3D8081 -Dlog4j.configuration=3Dlog4j-server.properties -Dlog4j.defaultI=
nitOverride=3Dtrue -Dcassandra-pidfile=3D/var/run/cassandra/cassandra.pid -=
cp :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share=
/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4=
.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/sha=
re/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec=
-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/=
cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinke=
dhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cass=
andra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.=
0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cass=
andra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/=
usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simp=
le-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/=
libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cass=
andra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.=
jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/s=
lf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar org.apach=
e.cassandra.thrift.CassandraDaemon</div>

<div><br></div><div>(*) Also, why would they all OOM so close to each other=
? =A0Bad luck? =A0Or once the first node went down, is there an increased c=
hance of the rest?</div><div><br></div><div>I&#39;m still on 0.7.4, when I =
released cassandra to production that was the latest release. =A0In additio=
n to (or instead of?) fixing memory settings, I&#39;m guessing I should upg=
rade. =A0</div>

<div><br></div><div>will</div>
<div><br></div><div><br></div><div><br></div><div><br></div>

--bcaec5540054e6b15f04a64c2fbb--