Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTilA6bXFhRoO9xwOu2zflPDCXIgRWPECplvz8nY-@mail.gmail.com>
References: <AANLkTimIrYQ8ErDd0XQMoiAJ_uOjkwKuW2UDR1b-SApW@mail.gmail.com>
	 <AANLkTimiaDx1CC7q-SSDQtatUFqw73y8XsJq6b_cknWg@mail.gmail.com>
	 <AANLkTikUS7o0tsoFYkDa6OkMfdoHX9GfhImNq3dzOIKI@mail.gmail.com>
	 <AANLkTilA6bXFhRoO9xwOu2zflPDCXIgRWPECplvz8nY-@mail.gmail.com>
Date: Mon, 17 May 2010 17:31:44 -0700
Message-ID: <AANLkTilDweJgVynAnfptzcQEjcteY7PgWCI9IIGQhyUW@mail.gmail.com>
Subject: Re: Problems running Cassandra 0.6.1 on large EC2 instances.
From: Curt Bererton <curt@zipzapplay.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=000e0cd11826495f420486d377b2

--000e0cd11826495f420486d377b2
Content-Type: text/plain; charset=UTF-8

Agreed, and I just saw that in storage conf that a higher value for the
MemtableFlushAfterMinutes is suggested otherwise you might get a "flush
storm: of all your memtables flushing at once". I've changed that as well.

--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 5:27 PM, Mark Greene <greenemj@gmail.com> wrote:

> Since you only have 7.5GB of memory, it's a really bad idea to set your
> heap space to a max of 7GB. Remember, the java process heap will be larger
> than what Xmx is allowed to grow to. If you reach this level, you can
> start swapping which is very very bad. As Brandon pointed out, you haven't
> exhausted your physically memory yet but you still want to lower Xmx to
> something like 5 maybe 6 GB.
>
>
> On Mon, May 17, 2010 at 7:02 PM, Curt Bererton <curt@zipzapplay.com>wrote:
>
>> Here are the current jvm args  and java version:
>>
>> # Arguments to pass to the JVM
>> JVM_OPTS=" \
>>         -ea \
>>         -Xms128M \
>>         -Xmx7G \
>>         -XX:TargetSurvivorRatio=90 \
>>         -XX:+AggressiveOpts \
>>         -XX:+UseParNewGC \
>>         -XX:+UseConcMarkSweepGC \
>>         -XX:+CMSParallelRemarkEnabled \
>>         -XX:+HeapDumpOnOutOfMemoryError \
>>         -XX:SurvivorRatio=128 \
>>         -XX:MaxTenuringThreshold=0 \
>>         -Dcom.sun.management.jmxremote.port=8080 \
>>         -Dcom.sun.management.jmxremote.ssl=false \
>>         -Dcom.sun.management.jmxremote.authenticate=false"
>>
>> java -version outputs:
>> java version "1.6.0_20"
>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>
>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>> hammered right now, and it is receiving 0 ops/sec from me since I
>> disconnected it from our application right now until I can figure out what's
>> going on.
>>
>> running top on the machine I get:
>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>> 15.13
>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>> 24.6%st
>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>
>>
>> I have jconsole up and running, and jconsole vm Summary tab says:
>>  - total physical memory: 7,872,040 K
>>  - Free physical memory: 4,253,036 K
>>  - Total swap space: 0K
>>  - Free swap space: 0K
>>  - Committed virtual memory: 8,096648K
>>
>> Is there a specific thread I can look at in jconsole that might give me a
>> clue?  It's weird that it's still at 100% cpu even though it's getting no
>> traffic from outside right now.  I suppose it might still be talking across
>> the machines though.
>>
>> Also, stopping cassandra and starting cassandra on one of the 4 machines
>> caused the CPU to go back down to almost normal levels.
>>
>> Here's the ring;
>>
>> Address       Status     Load
>> Range                                      Ring
>>
>> 170141183460469231731687303715884105728
>> 10.251.XX.XX Up         2.15 MB
>> 42535295865117307932921825928971026432     |<--|
>> 10.250.XX.XX  Up         2.42 MB
>> 85070591730234615865843651857942052864     |   |
>> 10.250.XX.XX Up         2.47 MB
>> 127605887595351923798765477786913079296    |   |
>> 10.250.XX.XX Up         2.46 MB
>> 170141183460469231731687303715884105728    |-->|
>>
>> Any thoughts?
>>
>> Best,
>>
>> Curt
>> --
>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>> http://apps.facebook.com/happyhabitat
>>
>>
>> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <greenemj@gmail.com> wrote:
>>
>>> Can you provide us with the current JVM args? Also, what type of work
>>> load you are giving the ring (op/s)?
>>>
>>>
>>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <curt@zipzapplay.com>wrote:
>>>
>>>> Hello Cassandra users+experts,
>>>>
>>>> Hopefully someone will be able to point me in the correct direction. We
>>>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>>>> was great and ready to move to production. We are currently running a ring
>>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>>> servers on production with a replication factor of 3 and a QUORUM
>>>> consistency level. We ran a test on 1% of our users, and everything was
>>>> writing to and reading from cassandra great for the first 3 hours. After
>>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>>> machines at once. This smells to me like a GC issue, and I'm looking into it
>>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>>> all the way up and running without CPU spiking I would be forever in their
>>>> debt.
>>>>
>>>> I suspect that anyone else running cassandra on large EC2 instances
>>>> might just be able to tell me what JVM args they are successfully using in a
>>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>>> and did they go to batched writes due to bug 1014? (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>>> all my questions.
>>>>
>>>> Is there anyone on the list who is using large EC2 instances in
>>>> production? Would you be kind enough to share your JVM arguments and any
>>>> other tips?
>>>>
>>>> Thanks for any help,
>>>> Curt
>>>> --
>>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>>> http://apps.facebook.com/happyhabitat
>>>>
>>>
>>>
>>
>

--000e0cd11826495f420486d377b2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Agreed, and I just saw that in storage conf that a higher value for the Mem=
tableFlushAfterMinutes is suggested otherwise you might get a &quot;flush s=
torm: of all your memtables flushing at once&quot;. I&#39;ve changed that a=
s well.<br>
<br clear=3D"all">--<br>Curt, ZipZapPlay Inc., <a href=3D"http://www.PlayCr=
after.com">www.PlayCrafter.com</a>, <a href=3D"http://apps.facebook.com/hap=
pyhabitat">http://apps.facebook.com/happyhabitat</a><br>
<br><br><div class=3D"gmail_quote">On Mon, May 17, 2010 at 5:27 PM, Mark Gr=
eene <span dir=3D"ltr">&lt;<a href=3D"mailto:greenemj@gmail.com">greenemj@g=
mail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); p=
adding-left: 1ex;">
Since you only have 7.5GB of memory, it&#39;s a really bad idea to set your=
 heap space to a max of 7GB. Remember, the java process heap will be larger=
 than what Xmx is allowed to grow to. If you reach this level, you can star=
t=C2=A0swapping=C2=A0which is very very bad. As Brandon pointed out, you ha=
ven&#39;t exhausted your physically memory yet but you still want to lower =
Xmx to something like 5 maybe 6 GB.=C2=A0<div>
<div></div><div class=3D"h5"><br>

<br><div class=3D"gmail_quote">On Mon, May 17, 2010 at 7:02 PM, Curt Berert=
on <span dir=3D"ltr">&lt;<a href=3D"mailto:curt@zipzapplay.com" target=3D"_=
blank">curt@zipzapplay.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(2=
04, 204, 204); padding-left: 1ex;">


Here are the current jvm args=C2=A0 and java version:<br><br># Arguments to=
 pass to the JVM<br>JVM_OPTS=3D&quot; \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0 -ea \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Xms128M \=
<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Xmx7G \<br>=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:TargetSurvivorRatio=3D90 \<br>=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+AggressiveOpts \<br>


=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+UseParNewGC \<br>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+UseConcMarkSweepGC \<br>=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+CMSParallelRemarkEnabled \<br>=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:+HeapDumpOnOutOfMemoryError \<br>=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:SurvivorRatio=3D128 \<br>=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -XX:MaxTenuringThreshold=3D0 \<br>


=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Dcom.sun.management.jmxremote.p=
ort=3D8080 \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -Dcom.sun.manage=
ment.jmxremote.ssl=3Dfalse \<br>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =
-Dcom.sun.management.jmxremote.authenticate=3Dfalse&quot;<br><br>java -vers=
ion outputs:<br>java version &quot;1.6.0_20&quot;<br>


Java(TM) SE Runtime Environment (build 1.6.0_20-b02)<br>Java HotSpot(TM) 64=
-Bit Server VM (build 16.3-b01, mixed mode)<br><br>So pretty much the defau=
lts aside from the 7Gig max heap. CPU is totally hammered right now, and it=
 is receiving 0 ops/sec from me since I disconnected it from our applicatio=
n right now until I can figure out what&#39;s going on.<br>


<br>running top on the machine I get:<br>top - 18:56:32 up 2 days, 20:57,=
=C2=A0 2 users,=C2=A0 load average: 14.97, 15.24, 15.13<br>Tasks:=C2=A0 87 =
total,=C2=A0=C2=A0 5 running,=C2=A0 82 sleeping,=C2=A0=C2=A0 0 stopped,=C2=
=A0=C2=A0 0 zombie<br>Cpu(s): 40.1%us, 33.9%sy,=C2=A0 0.0%ni,=C2=A0 0.1%id,=
=C2=A0 0.0%wa,=C2=A0 0.0%hi,=C2=A0 1.3%si, 24.6%st<br>


Mem:=C2=A0=C2=A0 7872040k total,=C2=A0 3618764k used,=C2=A0 4253276k free,=
=C2=A0=C2=A0 387536k buffers<br>Swap:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 0k total,=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0k used,=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0k free,=C2=A0 1655556k cached<br><br>=
=C2=A0 PID USER=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 PR=C2=A0 NI=C2=A0 VIRT=C2=A0 =
RES=C2=A0 SHR S %CPU %MEM=C2=A0=C2=A0=C2=A0 TIME+=C2=A0 COMMAND=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 <br>


=C2=A02566 cassandr=C2=A0 25=C2=A0=C2=A0 0 7906m 639m=C2=A0 10m S=C2=A0 150=
=C2=A0 8.3=C2=A0=C2=A0 5846:35 java=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 <br><br><br>I have jconsole up and running, and jconsole vm Summary tab=
 says:<br>=C2=A0- total physical memory: 7,872,040 K<br>=C2=A0- Free physic=
al memory: 4,253,036 K<br>


=C2=A0- Total swap space: 0K<br>=C2=A0- Free swap space: 0K<br>=C2=A0- Comm=
itted virtual memory: 8,096648K<br><br>Is there a specific thread I can loo=
k at in jconsole that might give me a clue?=C2=A0 It&#39;s weird that it=
9;s still at 100% cpu even though it&#39;s getting no traffic from outside =
right now.=C2=A0 I suppose it might still be talking across the machines th=
ough.<br>


<br>Also, stopping cassandra and starting cassandra on one of the 4 machine=
s caused the CPU to go back down to almost normal levels.<br><br>Here&#39;s=
 the ring;<div><br>Address=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Status=C2=A0=
=C2=A0=C2=A0=C2=A0 Load=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 Range=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 Ring<br>


</div>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 170141183460469231731687303715884105728=C2=A0=C2=A0=C2=A0 <br>10.251.XX=
.XX Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.15 MB=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 42535295865117307932921825928971026432=C2=A0=C2=A0=
=C2=A0=C2=A0 |&lt;--|<br>10.250.XX.XX=C2=A0 Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 2.42 MB=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 850705917=
30234615865843651857942052864=C2=A0=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 |<br>


10.250.XX.XX Up=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.47 MB=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 127605887595351923798765477786913079296=
=C2=A0=C2=A0=C2=A0 |=C2=A0=C2=A0 |<br>10.250.XX.XX Up=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 2.46 MB=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 170=
141183460469231731687303715884105728=C2=A0=C2=A0=C2=A0 |--&gt;|<br><br>Any =
thoughts?<br><br>Best,<div>

<br>Curt<br clear=3D"all">
--<br>Curt, ZipZapPlay Inc., <a href=3D"http://www.PlayCrafter.com" target=
=3D"_blank">www.PlayCrafter.com</a>, <a href=3D"http://apps.facebook.com/ha=
ppyhabitat" target=3D"_blank">http://apps.facebook.com/happyhabitat</a><br>

<br><br></div><div><div></div><div><div class=3D"gmail_quote">On Mon, May 1=
7, 2010 at 3:51 PM, Mark Greene <span dir=3D"ltr">&lt;<a href=3D"mailto:gre=
enemj@gmail.com" target=3D"_blank">greenemj@gmail.com</a>&gt;</span> wrote:=
<br>


<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Can you provide us with the current JVM args? Also, what type of work load =
you are giving the ring (op/s)?<div><div></div><div><br><br><div class=3D"g=
mail_quote">On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <span dir=3D"ltr=
">&lt;<a href=3D"mailto:curt@zipzapplay.com" target=3D"_blank">curt@zipzapp=
lay.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Hello Cassandra u=
sers+experts,<br><br>Hopefully someone will be able to point me in the corr=
ect direction. We have cassandra 0.6.1 working on our test servers and we *=
thought* everything was great and ready to move to production. We are curre=
ntly running a ring of 4 large instance EC2 (<a href=3D"http://aws.amazon.c=
om/ec2/instance-types/" target=3D"_blank">http://aws.amazon.com/ec2/instanc=
e-types/</a>) servers on production with a replication factor of 3 and a QU=
ORUM consistency level. We ran a test on 1% of our users, and everything wa=
s writing to and reading from cassandra great for the first 3 hours. After =
that point CPU usage spiked to 100% and stayed there, basically on all 4 ma=
chines at once. This smells to me like a GC issue, and I&#39;m looking into=
 it with jconsole right now. If anyone can help me debug this and get cassa=
ndra all the way up and running without CPU spiking I would be forever in t=
heir debt.<br>


<br>I suspect that anyone else running cassandra on large EC2 instances mig=
ht just be able to tell me what JVM args they are successfully using in a p=
roduction environment and if they upgraded to Cassandra 0.6.2 from 0.6.1, a=
nd did they go to batched writes due to bug 1014? (<a href=3D"https://issue=
s.apache.org/jira/browse/CASSANDRA-1014" target=3D"_blank">https://issues.a=
pache.org/jira/browse/CASSANDRA-1014</a>) That might answer all my question=
s.<br>


<br>Is there anyone on the list who is using large EC2 instances in product=
ion? Would you be kind enough to share your JVM arguments and any other tip=
s?<br><br>Thanks for any help,<br>Curt<br clear=3D"all"><font color=3D"#888=
888">--<br>


Curt, ZipZapPlay Inc., <a href=3D"http://www.PlayCrafter.com" target=3D"_bl=
ank">www.PlayCrafter.com</a>, <a href=3D"http://apps.facebook.com/happyhabi=
tat" target=3D"_blank">http://apps.facebook.com/happyhabitat</a><br>


</font></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>

--000e0cd11826495f420486d377b2--