Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=TmFqQxnImp
	S7V6iYnIXepnuNwTAHUqCKMGzNkr+QMIv+ME8mQkWAbO8MwHtVjvMj8wbhss8aq1
	06N/gBiPMSE0HEFrcr+YjyhYGS6G/HM8q21vj2DdL3CdZ43B1g58STEXC8ezIDvk
	UQDh5lojLfJ26pZf4KBpU4cBOX2mX+GTI=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1278)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F"
Subject: Re: frequent node up/downs
Date: Sat, 7 Jul 2012 07:09:18 +1200
In-Reply-To: 
 <CAGNi5tknFXDkbXAupNJ5wkry5VSxQ1zPQmVXxPuuFQ+RYPeoFQ@mail.gmail.com>
To: user@cassandra.apache.org
References: 
 <CAGNi5tmLWgMjZmEY_WbsPb0DUXH3zWCVfNCL9AjHOPEAuc=ukg@mail.gmail.com>
 <20120702121753.2A63.C3984673@terra.com.br>
 <CAGNi5tn_AMKBWiEjESjPh_Cv9aGsj9LvnwG6vYK1ts7Aj_Ab=w@mail.gmail.com>
 <CAGNi5tnS+5qpyWJ7E-aMcBh-DbJ=vazd3vo-Gv+gkXUtom6Jyw@mail.gmail.com>
 <09277B7E-4ADE-4FDE-99E7-ED386CCFADC4@thelastpickle.com>
 <CAGNi5tknFXDkbXAupNJ5wkry5VSxQ1zPQmVXxPuuFQ+RYPeoFQ@mail.gmail.com>
Message-Id: <102464DA-2C4B-4E58-84E2-C48572C9F2DF@thelastpickle.com>


--Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

> It looks like this happens when there is a promotion failure.=20

Java Heap is full.=20
Memory is fragmented.=20
Use C for web scale.=20

> Also is it normal to see the "Heap is xx full.  You may need to reduce =
memtable and/or cache sizes" message quite often? I haven't turned on =
row caches or changed any default memtable size settings so I am =
wondering why the old gen fills up.

It's odd to get that out of the box with an 8GB heap on a 1.1.X install.=20=


What sort of work load ? Is it under heavy inserts ?
Do you have a lot of CF's ? A lot of secondary indexes ?
After the messages is it able to reduce heap usage ?
Does it seem to correlate to compactions ?
Is the node able to get back to a healthy state ?
If this is testing are you able to pull back to a workload where the =
issues doe not appear ?=20


Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 7/07/2012, at 4:33 AM, feedly team wrote:

> I reduced the load and the problem hasn't been happening as much. =
After enabling gc logging, I see messages mentioning promotion failed =
when the pauses happen. It looks like this happens when there is a =
promotion failure. =46rom reading on the web it looks like I could try =
reducing the CMSInitiatingOccupancyFraction value and/or decreasing the =
young gen size to try to avoid this scenario.
>=20
> Also is it normal to see the "Heap is xx full.  You may need to reduce =
memtable and/or cache sizes" message quite often? I haven't turned on =
row caches or changed any default memtable size settings so I am =
wondering why the old gen fills up.
>=20
>=20
> On Wed, Jul 4, 2012 at 6:28 AM, aaron morton <aaron@thelastpickle.com> =
wrote:
>> What accounts for the much larger virtual number? some kind of =
off-heap memory?=20
> http://wiki.apache.org/cassandra/FAQ#mmap
>=20
>> I'm a little puzzled as to why I would get such long pauses without =
swapping.=20
> The two are not related. On startup the JVM memory is locked so it =
will not swap, from then on memory management is pretty much up the JVM.=20=

>=20
> Getting a lot of ParNew activity does not mean the JVM is low on =
memory, it means there is a lot of activity in the new heap.=20
>=20
> If you have a lot of insert activity (typically in a load test) you =
can generate a lot of GC activity. Try reducing the load to a point =
where it does not ht GC and then increase to find the cause. Also if you =
can connect JConole to the JVM you may get a better view of the heap =
usage.
>=20
> Hope that helps.=20
>=20
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>=20
> On 3/07/2012, at 3:41 PM, feedly team wrote:
>=20
>> Couple more details. I confirmed that swap space is not being used =
(free -m shows 0 swap) and cassandra.log has a message like "JNA =
mlockall successful". top shows the process having 9g in resident memory =
but 21.6g in virtual...What accounts for the much larger virtual number? =
some kind of off-heap memory?=20
>>=20
>> I'm a little puzzled as to why I would get such long pauses without =
swapping. I uncommented all the gc logging options in cassandra-env.sh =
to try to see what is going on when the node freezes.
>>=20
>> Thanks
>> Kireet
>>=20
>> On Mon, Jul 2, 2012 at 9:51 PM, feedly team <feedlydev@gmail.com> =
wrote:
>> Yeah I noticed the leap second problem and ran the suggested fix, but =
I have been facing these problems before Saturday and still see the =
occasional failures after running the fix.=20
>>=20
>> Thanks.
>>=20
>>=20
>> On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both <mboth@terra.com.br> =
wrote:
>> Yeah! Look that.
>> =
http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-th=
e-internet-scorecard/
>> I had the same problem. The solution was rebooting.
>>=20
>> On Mon, 2 Jul 2012 11:08:57 -0400
>> feedly team <feedlydev@gmail.com> wrote:
>>=20
>> > Hello,
>> >    I recently set up a 2 node cassandra cluster on dedicated =
hardware. In
>> > the logs there have been a lot of "InetAddress xxx is now dead' or =
UP
>> > messages. Comparing the log messages between the 2 nodes, they seem =
to
>> > coincide with extremely long ParNew collections. I have seem some =
of up to
>> > 50 seconds. The installation is pretty vanilla, I didn't change any
>> > settings and the machines don't seem particularly busy - cassandra =
is the
>> > only thing running on the machine with an 8GB heap. The machine has =
64GB of
>> > RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is =
xxx
>> > full. You may need to reduce memtable and/or cache sizes' messages. =
Would
>> > this help with the long ParNew collections? That message seems to =
be
>> > triggered on a full collection.
>>=20
>> --
>> Marcus Both
>>=20
>>=20
>>=20
>=20
>=20


--Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><div><blockquote type=3D"cite">It looks like this happens when there =
is a promotion failure.&nbsp;</blockquote><br><div>Java Heap is =
full.&nbsp;</div><div>Memory is fragmented.&nbsp;</div><div>Use C for =
web scale.&nbsp;</div><div><br></div><div><blockquote =
type=3D"cite"><div>Also is it normal to see the "Heap is xx full. =
&nbsp;You may need to reduce memtable and/or cache sizes" message quite =
often? I haven't turned on row caches or changed any default memtable =
size settings so I am wondering why the old gen fills =
up.<br></div></blockquote><br></div><div>It's odd to get that out of the =
box with an 8GB heap on a 1.1.X =
install.&nbsp;</div><div><br></div><div>What sort of work load ? Is it =
under heavy inserts ?</div><div>Do you have a lot of CF's ? A lot of =
secondary indexes ?</div><div>After the messages is it able to reduce =
heap usage ?</div><div>Does it seem to correlate to compactions =
?</div><div>Is the node able to get back to a healthy state =
?</div><div>If this is testing are you able to pull back to a workload =
where the issues doe not appear =
?&nbsp;</div><div><br></div><div><br></div><div>Cheers</div><div><br></div=
><div><div apple-content-edited=3D"true">
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: =
0px; text-transform: none; white-space: normal; widows: 2; word-spacing: =
0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></div></span></span>
</div>

<br><div><div>On 7/07/2012, at 4:33 AM, feedly team wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite">I reduced =
the load and the problem hasn't been happening as much. After enabling =
gc logging, I see messages mentioning promotion failed when the pauses =
happen. It looks like this happens when there is a promotion failure. =
=46rom reading on the web it looks like I could try reducing the =
CMSInitiatingOccupancyFraction value and/or decreasing the young gen =
size to try to avoid this scenario.<div>
<br></div><div>Also is it normal to see the "Heap is xx full. &nbsp;You =
may need to reduce memtable and/or cache sizes" message quite often? I =
haven't turned on row caches or changed any default memtable size =
settings so I am wondering why the old gen fills up.<br>
<br><div><br></div><div><div class=3D"gmail_quote">On Wed, Jul 4, 2012 =
at 6:28 AM, aaron morton <span dir=3D"ltr">&lt;<a =
href=3D"mailto:aaron@thelastpickle.com" =
target=3D"_blank">aaron@thelastpickle.com</a>&gt;</span> =
wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word"><div class=3D"im"><blockquote =
type=3D"cite">What accounts for the much larger virtual number? some =
kind of off-heap memory?&nbsp;</blockquote></div><a =
href=3D"http://wiki.apache.org/cassandra/FAQ#mmap" =
target=3D"_blank">http://wiki.apache.org/cassandra/FAQ#mmap</a><div>
<br></div><div><div class=3D"im"><blockquote type=3D"cite"><div>I'm a =
little puzzled as to why I would get such long pauses without =
swapping.&nbsp;</div></blockquote></div>The two are not related. On =
startup the JVM memory is locked so it will not swap, from then on =
memory management is pretty much up the JVM.&nbsp;</div>
<div><br></div><div>Getting a lot of ParNew activity does not mean the =
JVM is low on memory, it means there is a lot of activity in the new =
heap.&nbsp;</div><div><br></div><div>If you have a lot of insert =
activity (typically in a load test) you can generate a lot of GC =
activity. Try reducing the load to a point where it does not ht GC and =
then increase to find the cause. Also if you can connect JConole to the =
JVM you may get a better view of the heap usage.</div>
<div><br></div><div>Hope that =
helps.&nbsp;</div><div><br></div><div><div>
<span =
style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;text-al=
ign:-webkit-auto;font-style:normal;font-weight:normal;line-height:normal;b=
order-collapse:separate;text-transform:none;font-size:medium;white-space:n=
ormal;font-family:Helvetica;word-spacing:0px"><span =
style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;font-st=
yle:normal;font-weight:normal;line-height:normal;border-collapse:separate;=
text-transform:none;font-size:medium;white-space:normal;font-family:Helvet=
ica;word-spacing:0px"><div style=3D"word-wrap:break-word">
<span =
style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;font-st=
yle:normal;font-weight:normal;line-height:normal;border-collapse:separate;=
text-transform:none;font-size:medium;white-space:normal;font-family:Helvet=
ica;word-spacing:0px"><div style=3D"word-wrap:break-word">
<span =
style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;font-st=
yle:normal;font-weight:normal;line-height:normal;border-collapse:separate;=
text-transform:none;font-size:medium;white-space:normal;font-family:Helvet=
ica;word-spacing:0px"><div style=3D"word-wrap:break-word">
<div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com/" =
target=3D"_blank">http://www.thelastpickle.com</a></div></div></div></span=
></div>
</span></div></span></span>
</div><div><div class=3D"h5">
<br><div><div>On 3/07/2012, at 3:41 PM, feedly team =
wrote:</div><br><blockquote type=3D"cite">Couple more details. I =
confirmed that swap space is not being used (free -m shows 0 swap) and =
cassandra.log has a message like "JNA mlockall successful". top shows =
the process having 9g in resident memory but 21.6g in virtual...What =
accounts for the much larger virtual number? some kind of off-heap =
memory?&nbsp;<div>

<br></div><div>I'm a little puzzled as to why I would get such long =
pauses without swapping. I uncommented all the gc logging options in =
cassandra-env.sh to try to see what is going on when the node =
freezes.<br><div>
<br>
</div><div>Thanks</div><div>Kireet<br><div><br><div =
class=3D"gmail_quote">On Mon, Jul 2, 2012 at 9:51 PM, feedly team <span =
dir=3D"ltr">&lt;<a href=3D"mailto:feedlydev@gmail.com" =
target=3D"_blank">feedlydev@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">Yeah I noticed the =
leap second problem and ran the suggested fix, but I have been facing =
these problems before Saturday and still see the occasional failures =
after running the fix.&nbsp;<div>

<br></div><div>Thanks.<div><div><br><br><div class=3D"gmail_quote">
On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both <span dir=3D"ltr">&lt;<a =
href=3D"mailto:mboth@terra.com.br" =
target=3D"_blank">mboth@terra.com.br</a>&gt;</span> =
wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">


Yeah! Look that.<br>
<a =
href=3D"http://arstechnica.com/business/2012/07/one-day-later-the-leap-sec=
ond-v-the-internet-scorecard/" =
target=3D"_blank">http://arstechnica.com/business/2012/07/one-day-later-th=
e-leap-second-v-the-internet-scorecard/</a><br>


I had the same problem. The solution was rebooting.<br>
<div><div><br>
On Mon, 2 Jul 2012 11:08:57 -0400<br>
feedly team &lt;<a href=3D"mailto:feedlydev@gmail.com" =
target=3D"_blank">feedlydev@gmail.com</a>&gt; wrote:<br>
<br>
&gt; Hello,<br>
&gt; &nbsp; &nbsp;I recently set up a 2 node cassandra cluster on =
dedicated hardware. In<br>
&gt; the logs there have been a lot of "InetAddress xxx is now dead' or =
UP<br>
&gt; messages. Comparing the log messages between the 2 nodes, they seem =
to<br>
&gt; coincide with extremely long ParNew collections. I have seem some =
of up to<br>
&gt; 50 seconds. The installation is pretty vanilla, I didn't change =
any<br>
&gt; settings and the machines don't seem particularly busy - cassandra =
is the<br>
&gt; only thing running on the machine with an 8GB heap. The machine has =
64GB of<br>
&gt; RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is =
xxx<br>
&gt; full. You may need to reduce memtable and/or cache sizes' messages. =
Would<br>
&gt; this help with the long ParNew collections? That message seems to =
be<br>
&gt; triggered on a full collection.<br>
<br>
</div></div><span><font color=3D"#888888">--<br>
Marcus Both<br>
<br>
</font></span></blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div>
=
</blockquote></div><br></div></div></div></div></blockquote></div><br></di=
v></div>
</blockquote></div><br></div></div></body></html>=

--Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F--