Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96D486EE5 for ; Wed, 22 Jun 2011 12:35:10 +0000 (UTC) Received: (qmail 84522 invoked by uid 500); 22 Jun 2011 12:35:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 84483 invoked by uid 500); 22 Jun 2011 12:35:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 84475 invoked by uid 99); 22 Jun 2011 12:35:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 12:35:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 209.85.214.44 is neither permitted nor denied by domain of oberman@civicscience.com) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 12:35:00 +0000 Received: by bwz13 with SMTP id 13so809134bwz.31 for ; Wed, 22 Jun 2011 05:34:37 -0700 (PDT) Received: by 10.205.82.80 with SMTP id ab16mr134481bkc.66.1308746058215; Wed, 22 Jun 2011 05:34:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.15.66 with HTTP; Wed, 22 Jun 2011 05:33:58 -0700 (PDT) X-Originating-IP: [24.131.19.240] From: William Oberman Date: Wed, 22 Jun 2011 08:33:58 -0400 Message-ID: Subject: OOM (or, what settings to use on AWS large?) To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec5540054e6b15f04a64c2fbb --bcaec5540054e6b15f04a64c2fbb Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I woke up this morning to all 4 of 4 of my cassandra instances reporting they were down in my cluster. I quickly started them all, and everything seems fine. I'm doing a postmortem now, but it appears they all OOM'd at roughly the same time, which was not reported in any cassandra log, but I discovered something in /var/log/kern that showed java died of oom(*). In amazon, I'm using large instances for cassandra, and they have no swap (as recommended), so I have ~8GB of ram. Should I use a different max mem setting? I'm using a stock rpm from riptano/datastax. If I run "ps -aux" = I get: /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=3D42 -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold=3D1 -XX:CMSInitiatingOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=3Dtrue -Djava.rmi.server.hostname=3DX.X.X.X -Dcom.sun.management.jmxremote.port=3D8080 -Dcom.sun.management.jmxremote.ssl=3Dfalse -Dcom.sun.management.jmxremote.authenticate=3Dfalse -Dmx4jaddress=3D0.0.0.0 -Dmx4jport=3D8081 -Dlog4j.configuration=3Dlog4j-server.properties -Dlog4j.defaultInitOverride=3Dtrue -Dcassandra-pidfile=3D/var/run/cassandra/cassandra.pid -cp :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/ca= ssandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-= fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/= cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.= 2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cas= sandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedha= shmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassand= ra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.j= ar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassand= ra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr= /share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-= 1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/lib= thrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassand= ra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar= :/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4= j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar org.apache.cassandra.thrift.CassandraDaemon (*) Also, why would they all OOM so close to each other? Bad luck? Or onc= e the first node went down, is there an increased chance of the rest? I'm still on 0.7.4, when I released cassandra to production that was the latest release. In addition to (or instead of?) fixing memory settings, I'= m guessing I should upgrade. will --bcaec5540054e6b15f04a64c2fbb Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I woke up this morning to all 4 of 4 of my cassandra instances reporting th= ey were down in my cluster. =A0I quickly started them all, and everything s= eems fine. =A0I'm doing a=A0postmortem=A0now, but it appears they all O= OM'd at roughly the same time, which was not reported in any cassandra = log, but I discovered something in /var/log/kern that showed java died of o= om(*). =A0In amazon, I'm using large instances for cassandra, and they = have no swap (as recommended), so I have ~8GB of ram. =A0Should I use a dif= ferent max mem setting? =A0I'm using a stock rpm from riptano/datastax.= =A0If I run "ps -aux" I get:

/usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPr= iorityPolicy=3D42 -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryEr= ror -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemar= kEnabled -XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold=3D1 -XX:CMSInitiati= ngOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.pref= erIPv4Stack=3Dtrue -Djava.rmi.server.hostname=3DX.X.X.X -Dcom.sun.managemen= t.jmxremote.port=3D8080 -Dcom.sun.management.jmxremote.ssl=3Dfalse -Dcom.su= n.management.jmxremote.authenticate=3Dfalse -Dmx4jaddress=3D0.0.0.0 -Dmx4jp= ort=3D8081 -Dlog4j.configuration=3Dlog4j-server.properties -Dlog4j.defaultI= nitOverride=3Dtrue -Dcassandra-pidfile=3D/var/run/cassandra/cassandra.pid -= cp :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share= /cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4= .0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/sha= re/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec= -1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/= cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinke= dhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cass= andra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.= 0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cass= andra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/= usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simp= le-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/= libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cass= andra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.= jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/s= lf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar org.apach= e.cassandra.thrift.CassandraDaemon

(*) Also, why would they all OOM so close to each other= ? =A0Bad luck? =A0Or once the first node went down, is there an increased c= hance of the rest?

I'm still on 0.7.4, when I = released cassandra to production that was the latest release. =A0In additio= n to (or instead of?) fixing memory settings, I'm guessing I should upg= rade. =A0

will




--bcaec5540054e6b15f04a64c2fbb--