Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTimiaDx1CC7q-SSDQtatUFqw73y8XsJq6b_cknWg@mail.gmail.com>
References: <AANLkTimIrYQ8ErDd0XQMoiAJ_uOjkwKuW2UDR1b-SApW@mail.gmail.com>
	 <AANLkTimiaDx1CC7q-SSDQtatUFqw73y8XsJq6b_cknWg@mail.gmail.com>
Date: Mon, 17 May 2010 16:02:07 -0700
Message-ID: <AANLkTikUS7o0tsoFYkDa6OkMfdoHX9GfhImNq3dzOIKI@mail.gmail.com>
Subject: Re: Problems running Cassandra 0.6.1 on large EC2 instances.
From: Curt Bererton <curt@zipzapplay.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=000e0cd1a104ceee890486d23614

--000e0cd1a104ceee890486d23614
Content-Type: text/plain; charset=UTF-8

Here are the current jvm args  and java version:

# Arguments to pass to the JVM
JVM_OPTS=" \
        -ea \
        -Xms128M \
        -Xmx7G \
        -XX:TargetSurvivorRatio=90 \
        -XX:+AggressiveOpts \
        -XX:+UseParNewGC \
        -XX:+UseConcMarkSweepGC \
        -XX:+CMSParallelRemarkEnabled \
        -XX:+HeapDumpOnOutOfMemoryError \
        -XX:SurvivorRatio=128 \
        -XX:MaxTenuringThreshold=0 \
        -Dcom.sun.management.jmxremote.port=8080 \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcom.sun.management.jmxremote.authenticate=false"

java -version outputs:
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

So pretty much the defaults aside from the 7Gig max heap. CPU is totally
hammered right now, and it is receiving 0 ops/sec from me since I
disconnected it from our application right now until I can figure out what's
going on.

running top on the machine I get:
top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
15.13
Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
24.6%st
Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
Swap:        0k total,        0k used,        0k free,  1655556k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND
 2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java


I have jconsole up and running, and jconsole vm Summary tab says:
 - total physical memory: 7,872,040 K
 - Free physical memory: 4,253,036 K
 - Total swap space: 0K
 - Free swap space: 0K
 - Committed virtual memory: 8,096648K

Is there a specific thread I can look at in jconsole that might give me a
clue?  It's weird that it's still at 100% cpu even though it's getting no
traffic from outside right now.  I suppose it might still be talking across
the machines though.

Also, stopping cassandra and starting cassandra on one of the 4 machines
caused the CPU to go back down to almost normal levels.

Here's the ring;
Address       Status     Load
Range                                      Ring

170141183460469231731687303715884105728
10.251.XX.XX Up         2.15 MB
42535295865117307932921825928971026432     |<--|
10.250.XX.XX  Up         2.42 MB
85070591730234615865843651857942052864     |   |
10.250.XX.XX Up         2.47 MB
127605887595351923798765477786913079296    |   |
10.250.XX.XX Up         2.46 MB
170141183460469231731687303715884105728    |-->|

Any thoughts?

Best,
Curt
--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 3:51 PM, Mark Greene <greenemj@gmail.com> wrote:

> Can you provide us with the current JVM args? Also, what type of work load
> you are giving the ring (op/s)?
>
>
> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <curt@zipzapplay.com>wrote:
>
>> Hello Cassandra users+experts,
>>
>> Hopefully someone will be able to point me in the correct direction. We
>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>> was great and ready to move to production. We are currently running a ring
>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>> servers on production with a replication factor of 3 and a QUORUM
>> consistency level. We ran a test on 1% of our users, and everything was
>> writing to and reading from cassandra great for the first 3 hours. After
>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>> machines at once. This smells to me like a GC issue, and I'm looking into it
>> with jconsole right now. If anyone can help me debug this and get cassandra
>> all the way up and running without CPU spiking I would be forever in their
>> debt.
>>
>> I suspect that anyone else running cassandra on large EC2 instances might
>> just be able to tell me what JVM args they are successfully using in a
>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>> and did they go to batched writes due to bug 1014? (
>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>> all my questions.
>>
>> Is there anyone on the list who is using large EC2 instances in
>> production? Would you be kind enough to share your JVM arguments and any
>> other tips?
>>
>> Thanks for any help,
>> Curt
>> --
>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>> http://apps.facebook.com/happyhabitat
>>
>
>

--000e0cd1a104ceee890486d23614
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Here are the current jvm args=C2=A0 and java version:<br><br># Arguments to=
 pass to the JVM<br>JVM_OPTS=3D&quot; \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0 -ea \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Xms128M \=
<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Xmx7G \<br>=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:TargetSurvivorRatio=3D90 \<br>=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+AggressiveOpts \<br>

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+UseParNewGC \<br>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+UseConcMarkSweepGC \<br>=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+CMSParallelRemarkEnabled \<br>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+HeapDumpOnOutOfMemoryError \<br>=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:SurvivorRatio=3D128 \<br>=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:MaxTenuringThreshold=3D0 \<br>

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Dcom.sun.management.jmxremote.p=
ort=3D8080 \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Dcom.sun.manage=
ment.jmxremote.ssl=3Dfalse \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =
-Dcom.sun.management.jmxremote.authenticate=3Dfalse&quot;<br><br>java -vers=
ion outputs:<br>java version &quot;1.6.0_20&quot;<br>

Java(TM) SE Runtime Environment (build 1.6.0_20-b02)<br>Java HotSpot(TM) 64=
-Bit Server VM (build 16.3-b01, mixed mode)<br><br>So pretty much the defau=
lts aside from the 7Gig max heap. CPU is totally hammered right now, and it=
 is receiving 0 ops/sec from me since I disconnected it from our applicatio=
n right now until I can figure out what&#39;s going on.<br>

<br>running top on the machine I get:<br>top - 18:56:32 up 2 days, 20:57,=
=C2=A0 2 users,=C2=A0 load average: 14.97, 15.24, 15.13<br>Tasks:=C2=A0 87 =
total,=C2=A0=C2=A0 5 running,=C2=A0 82 sleeping,=C2=A0=C2=A0 0 stopped,=C2=
=A0=C2=A0 0 zombie<br>Cpu(s): 40.1%us, 33.9%sy,=C2=A0 0.0%ni,=C2=A0 0.1%id,=
=C2=A0 0.0%wa,=C2=A0 0.0%hi,=C2=A0 1.3%si, 24.6%st<br>

Mem:=C2=A0=C2=A0 7872040k total,=C2=A0 3618764k used,=C2=A0 4253276k free,=
=C2=A0=C2=A0 387536k buffers<br>Swap:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 0k total,=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0k used,=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0k free,=C2=A0 1655556k cached<br><br>=
=C2=A0 PID USER=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 PR=C2=A0 NI=C2=A0 VIRT=C2=A0 =
RES=C2=A0 SHR S %CPU %MEM=C2=A0=C2=A0=C2=A0 TIME+=C2=A0 COMMAND=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 <br>

=C2=A02566 cassandr=C2=A0 25=C2=A0=C2=A0 0 7906m 639m=C2=A0 10m S=C2=A0 150=
=C2=A0 8.3=C2=A0=C2=A0 5846:35 java=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 <br><br><br>I have jconsole up and running, and jconsole vm Summary tab=
 says:<br>=C2=A0- total physical memory: 7,872,040 K<br>=C2=A0- Free physic=
al memory: 4,253,036 K<br>
=C2=A0- Total swap space: 0K<br>=C2=A0- Free swap space: 0K<br>=C2=A0- Comm=
itted virtual memory: 8,096648K<br><br>Is there a specific thread I can loo=
k at in jconsole that might give me a clue?=C2=A0 It&#39;s weird that it=
9;s still at 100% cpu even though it&#39;s getting no traffic from outside =
right now.=C2=A0 I suppose it might still be talking across the machines th=
ough.<br>
<br>Also, stopping cassandra and starting cassandra on one of the 4 machine=
s caused the CPU to go back down to almost normal levels.<br><br>Here&#39;s=
 the ring;<br>Address=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Status=C2=A0=C2=
=A0=C2=A0=C2=A0 Load=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =
Range=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 Ring<br>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 170141183460469231731687303715884105728=C2=A0=C2=A0=C2=A0 <br>10.251.XX=
.XX Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.15 MB=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 42535295865117307932921825928971026432=C2=A0=C2=A0=
=C2=A0=C2=A0 |&lt;--|<br>10.250.XX.XX=C2=A0 Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 2.42 MB=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 850705917=
30234615865843651857942052864=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 |<br>
10.250.XX.XX Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.47 MB=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 127605887595351923798765477786913079296=
=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 |<br>10.250.XX.XX Up=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.46 MB=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 170=
141183460469231731687303715884105728=C2=A0=C2=A0=C2=A0 |--&gt;|<br><br>Any =
thoughts?<br><br>Best,<br>Curt<br clear=3D"all">
--<br>Curt, ZipZapPlay Inc., <a href=3D"http://www.PlayCrafter.com" target=
=3D"_blank">www.PlayCrafter.com</a>, <a href=3D"http://apps.facebook.com/ha=
ppyhabitat" target=3D"_blank">http://apps.facebook.com/happyhabitat</a><br>

<br><br><div class=3D"gmail_quote">On Mon, May 17, 2010 at 3:51 PM, Mark Gr=
eene <span dir=3D"ltr">&lt;<a href=3D"mailto:greenemj@gmail.com" target=3D"=
_blank">greenemj@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(2=
04, 204, 204); padding-left: 1ex;">

Can you provide us with the current JVM args? Also, what type of work load =
you are giving the ring (op/s)?<div><div></div><div><br><br><div class=3D"g=
mail_quote">On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <span dir=3D"ltr=
">&lt;<a href=3D"mailto:curt@zipzapplay.com" target=3D"_blank">curt@zipzapp=
lay.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Hello Cassandra u=
sers+experts,<br><br>Hopefully someone will be able to point me in the corr=
ect direction. We have cassandra 0.6.1 working on our test servers and we *=
thought* everything was great and ready to move to production. We are curre=
ntly running a ring of 4 large instance EC2 (<a href=3D"http://aws.amazon.c=
om/ec2/instance-types/" target=3D"_blank">http://aws.amazon.com/ec2/instanc=
e-types/</a>) servers on production with a replication factor of 3 and a QU=
ORUM consistency level. We ran a test on 1% of our users, and everything wa=
s writing to and reading from cassandra great for the first 3 hours. After =
that point CPU usage spiked to 100% and stayed there, basically on all 4 ma=
chines at once. This smells to me like a GC issue, and I&#39;m looking into=
 it with jconsole right now. If anyone can help me debug this and get cassa=
ndra all the way up and running without CPU spiking I would be forever in t=
heir debt.<br>


<br>I suspect that anyone else running cassandra on large EC2 instances mig=
ht just be able to tell me what JVM args they are successfully using in a p=
roduction environment and if they upgraded to Cassandra 0.6.2 from 0.6.1, a=
nd did they go to batched writes due to bug 1014? (<a href=3D"https://issue=
s.apache.org/jira/browse/CASSANDRA-1014" target=3D"_blank">https://issues.a=
pache.org/jira/browse/CASSANDRA-1014</a>) That might answer all my question=
s.<br>


<br>Is there anyone on the list who is using large EC2 instances in product=
ion? Would you be kind enough to share your JVM arguments and any other tip=
s?<br><br>Thanks for any help,<br>Curt<br clear=3D"all"><font color=3D"#888=
888">--<br>


Curt, ZipZapPlay Inc., <a href=3D"http://www.PlayCrafter.com" target=3D"_bl=
ank">www.PlayCrafter.com</a>, <a href=3D"http://apps.facebook.com/happyhabi=
tat" target=3D"_blank">http://apps.facebook.com/happyhabitat</a><br>


</font></blockquote></div><br>
</div></div></blockquote></div><br>

--000e0cd1a104ceee890486d23614--