Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 31723 invoked from network); 17 May 2010 23:02:36 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 May 2010 23:02:36 -0000 Received: (qmail 99156 invoked by uid 500); 17 May 2010 23:02:35 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 99132 invoked by uid 500); 17 May 2010 23:02:35 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 99124 invoked by uid 99); 17 May 2010 23:02:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 May 2010 23:02:35 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.222.182] (HELO mail-pz0-f182.google.com) (209.85.222.182) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 May 2010 23:02:29 +0000 Received: by pzk12 with SMTP id 12so3346200pzk.9 for ; Mon, 17 May 2010 16:02:07 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.248.16 with SMTP id v16mr4218871rvh.230.1274137327627; Mon, 17 May 2010 16:02:07 -0700 (PDT) Received: by 10.141.40.1 with HTTP; Mon, 17 May 2010 16:02:07 -0700 (PDT) In-Reply-To: References: Date: Mon, 17 May 2010 16:02:07 -0700 Message-ID: Subject: Re: Problems running Cassandra 0.6.1 on large EC2 instances. From: Curt Bererton To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd1a104ceee890486d23614 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd1a104ceee890486d23614 Content-Type: text/plain; charset=UTF-8 Here are the current jvm args and java version: # Arguments to pass to the JVM JVM_OPTS=" \ -ea \ -Xms128M \ -Xmx7G \ -XX:TargetSurvivorRatio=90 \ -XX:+AggressiveOpts \ -XX:+UseParNewGC \ -XX:+UseConcMarkSweepGC \ -XX:+CMSParallelRemarkEnabled \ -XX:+HeapDumpOnOutOfMemoryError \ -XX:SurvivorRatio=128 \ -XX:MaxTenuringThreshold=0 \ -Dcom.sun.management.jmxremote.port=8080 \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.authenticate=false" java -version outputs: java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) So pretty much the defaults aside from the 7Gig max heap. CPU is totally hammered right now, and it is receiving 0 ops/sec from me since I disconnected it from our application right now until I can figure out what's going on. running top on the machine I get: top - 18:56:32 up 2 days, 20:57, 2 users, load average: 14.97, 15.24, 15.13 Tasks: 87 total, 5 running, 82 sleeping, 0 stopped, 0 zombie Cpu(s): 40.1%us, 33.9%sy, 0.0%ni, 0.1%id, 0.0%wa, 0.0%hi, 1.3%si, 24.6%st Mem: 7872040k total, 3618764k used, 4253276k free, 387536k buffers Swap: 0k total, 0k used, 0k free, 1655556k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2566 cassandr 25 0 7906m 639m 10m S 150 8.3 5846:35 java I have jconsole up and running, and jconsole vm Summary tab says: - total physical memory: 7,872,040 K - Free physical memory: 4,253,036 K - Total swap space: 0K - Free swap space: 0K - Committed virtual memory: 8,096648K Is there a specific thread I can look at in jconsole that might give me a clue? It's weird that it's still at 100% cpu even though it's getting no traffic from outside right now. I suppose it might still be talking across the machines though. Also, stopping cassandra and starting cassandra on one of the 4 machines caused the CPU to go back down to almost normal levels. Here's the ring; Address Status Load Range Ring 170141183460469231731687303715884105728 10.251.XX.XX Up 2.15 MB 42535295865117307932921825928971026432 |<--| 10.250.XX.XX Up 2.42 MB 85070591730234615865843651857942052864 | | 10.250.XX.XX Up 2.47 MB 127605887595351923798765477786913079296 | | 10.250.XX.XX Up 2.46 MB 170141183460469231731687303715884105728 |-->| Any thoughts? Best, Curt -- Curt, ZipZapPlay Inc., www.PlayCrafter.com, http://apps.facebook.com/happyhabitat On Mon, May 17, 2010 at 3:51 PM, Mark Greene wrote: > Can you provide us with the current JVM args? Also, what type of work load > you are giving the ring (op/s)? > > > On Mon, May 17, 2010 at 6:39 PM, Curt Bererton wrote: > >> Hello Cassandra users+experts, >> >> Hopefully someone will be able to point me in the correct direction. We >> have cassandra 0.6.1 working on our test servers and we *thought* everything >> was great and ready to move to production. We are currently running a ring >> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/) >> servers on production with a replication factor of 3 and a QUORUM >> consistency level. We ran a test on 1% of our users, and everything was >> writing to and reading from cassandra great for the first 3 hours. After >> that point CPU usage spiked to 100% and stayed there, basically on all 4 >> machines at once. This smells to me like a GC issue, and I'm looking into it >> with jconsole right now. If anyone can help me debug this and get cassandra >> all the way up and running without CPU spiking I would be forever in their >> debt. >> >> I suspect that anyone else running cassandra on large EC2 instances might >> just be able to tell me what JVM args they are successfully using in a >> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1, >> and did they go to batched writes due to bug 1014? ( >> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer >> all my questions. >> >> Is there anyone on the list who is using large EC2 instances in >> production? Would you be kind enough to share your JVM arguments and any >> other tips? >> >> Thanks for any help, >> Curt >> -- >> Curt, ZipZapPlay Inc., www.PlayCrafter.com, >> http://apps.facebook.com/happyhabitat >> > > --000e0cd1a104ceee890486d23614 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Here are the current jvm args=C2=A0 and java version:

# Arguments to= pass to the JVM
JVM_OPTS=3D" \
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 -ea \
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Xms128M \=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Xmx7G \
=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:TargetSurvivorRatio=3D90 \
=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+AggressiveOpts \
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+UseParNewGC \
=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+UseConcMarkSweepGC \
=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+CMSParallelRemarkEnabled \
=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+HeapDumpOnOutOfMemoryError \
=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:SurvivorRatio=3D128 \
=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:MaxTenuringThreshold=3D0 \
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Dcom.sun.management.jmxremote.p= ort=3D8080 \
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Dcom.sun.manage= ment.jmxremote.ssl=3Dfalse \
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = -Dcom.sun.management.jmxremote.authenticate=3Dfalse"

java -vers= ion outputs:
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64= -Bit Server VM (build 16.3-b01, mixed mode)

So pretty much the defau= lts aside from the 7Gig max heap. CPU is totally hammered right now, and it= is receiving 0 ops/sec from me since I disconnected it from our applicatio= n right now until I can figure out what's going on.

running top on the machine I get:
top - 18:56:32 up 2 days, 20:57,= =C2=A0 2 users,=C2=A0 load average: 14.97, 15.24, 15.13
Tasks:=C2=A0 87 = total,=C2=A0=C2=A0 5 running,=C2=A0 82 sleeping,=C2=A0=C2=A0 0 stopped,=C2= =A0=C2=A0 0 zombie
Cpu(s): 40.1%us, 33.9%sy,=C2=A0 0.0%ni,=C2=A0 0.1%id,= =C2=A0 0.0%wa,=C2=A0 0.0%hi,=C2=A0 1.3%si, 24.6%st
Mem:=C2=A0=C2=A0 7872040k total,=C2=A0 3618764k used,=C2=A0 4253276k free,= =C2=A0=C2=A0 387536k buffers
Swap:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0k total,=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0k used,=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0k free,=C2=A0 1655556k cached

= =C2=A0 PID USER=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 PR=C2=A0 NI=C2=A0 VIRT=C2=A0 = RES=C2=A0 SHR S %CPU %MEM=C2=A0=C2=A0=C2=A0 TIME+=C2=A0 COMMAND=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
=C2=A02566 cassandr=C2=A0 25=C2=A0=C2=A0 0 7906m 639m=C2=A0 10m S=C2=A0 150= =C2=A0 8.3=C2=A0=C2=A0 5846:35 java=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0


I have jconsole up and running, and jconsole vm Summary tab= says:
=C2=A0- total physical memory: 7,872,040 K
=C2=A0- Free physic= al memory: 4,253,036 K
=C2=A0- Total swap space: 0K
=C2=A0- Free swap space: 0K
=C2=A0- Comm= itted virtual memory: 8,096648K

Is there a specific thread I can loo= k at in jconsole that might give me a clue?=C2=A0 It's weird that it= 9;s still at 100% cpu even though it's getting no traffic from outside = right now.=C2=A0 I suppose it might still be talking across the machines th= ough.

Also, stopping cassandra and starting cassandra on one of the 4 machine= s caused the CPU to go back down to almost normal levels.

Here's= the ring;
Address=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Status=C2=A0=C2= =A0=C2=A0=C2=A0 Load=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = Range=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 Ring
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 170141183460469231731687303715884105728=C2=A0=C2=A0=C2=A0
10.251.XX= .XX Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.15 MB=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 42535295865117307932921825928971026432=C2=A0=C2=A0= =C2=A0=C2=A0 |<--|
10.250.XX.XX=C2=A0 Up=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 2.42 MB=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 850705917= 30234615865843651857942052864=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 |
10.250.XX.XX Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.47 MB=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 127605887595351923798765477786913079296= =C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 |
10.250.XX.XX Up=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 2.46 MB=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 170= 141183460469231731687303715884105728=C2=A0=C2=A0=C2=A0 |-->|

Any = thoughts?

Best,
Curt
--
Curt, ZipZapPlay Inc., www.PlayCrafter.com, http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 3:51 PM, Mark Gr= eene <greenemj@gmail.com> wrote:
Can you provide us with the current JVM args? Also, what type of work load = you are giving the ring (op/s)?


On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <curt@zipzapp= lay.com> wrote:
Hello Cassandra u= sers+experts,

Hopefully someone will be able to point me in the corr= ect direction. We have cassandra 0.6.1 working on our test servers and we *= thought* everything was great and ready to move to production. We are curre= ntly running a ring of 4 large instance EC2 (http://aws.amazon.com/ec2/instanc= e-types/) servers on production with a replication factor of 3 and a QU= ORUM consistency level. We ran a test on 1% of our users, and everything wa= s writing to and reading from cassandra great for the first 3 hours. After = that point CPU usage spiked to 100% and stayed there, basically on all 4 ma= chines at once. This smells to me like a GC issue, and I'm looking into= it with jconsole right now. If anyone can help me debug this and get cassa= ndra all the way up and running without CPU spiking I would be forever in t= heir debt.

I suspect that anyone else running cassandra on large EC2 instances mig= ht just be able to tell me what JVM args they are successfully using in a p= roduction environment and if they upgraded to Cassandra 0.6.2 from 0.6.1, a= nd did they go to batched writes due to bug 1014? (https://issues.a= pache.org/jira/browse/CASSANDRA-1014) That might answer all my question= s.

Is there anyone on the list who is using large EC2 instances in product= ion? Would you be kind enough to share your JVM arguments and any other tip= s?

Thanks for any help,
Curt
--
Curt, ZipZapPlay Inc., www.PlayCrafter.com, http://apps.facebook.com/happyhabitat


--000e0cd1a104ceee890486d23614--