Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=3j6ESt4num
	UPFlyD4s+BZK7Utp+ETIlbxmyxQJQ/YZBs4F+tPLEXYwplqnou1E0hE2ryx55lrb
	2ZlG56bOrro0Gi2UXrkF1jd3HHYZN6QoZWY80FkZo9rkYBk73ejaFRxAu8qI15U0
	FDCjxx1W7bZo+sl4wIMhYks5RWxRSZF0I=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1257)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480"
Subject: Re: ReplicateOnWriteStage exception causes a backlog in MutationStage
 that never clears
Date: Thu, 22 Mar 2012 06:24:31 +1300
In-Reply-To: 
 <CADJL=w6Ubw13Eh-3ZnnX818BZmtZ9uHf81pbbTZ3coEA+j6PYA@mail.gmail.com>
To: user@cassandra.apache.org
References: 
 <CADJL=w6Ubw13Eh-3ZnnX818BZmtZ9uHf81pbbTZ3coEA+j6PYA@mail.gmail.com>
Message-Id: <8E41EBA7-DCF9-44C0-A2A9-CA331F396732@thelastpickle.com>


--Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

The node is overloaded with hints. =20

I'll just grab the comments from code=85

            // avoid OOMing due to excess hints.  we need to do this =
check even for "live" nodes, since we can
            // still generate hints for those if it's overloaded or =
simply dead but not yet known-to-be-dead.
            // The idea is that if we have over maxHintsInProgress hints =
in flight, this is probably due to
            // a small number of nodes causing problems, so we should =
avoid shutting down writes completely to
            // healthy nodes.  Any node with no hintsInProgress is =
considered healthy.

Are the nodes going up and down a lot ? Are they under GC pressure. The =
other possibility is that you have overloaded the cluster.=20

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/03/2012, at 3:20 AM, Thomas van Neerijnen wrote:

> Hi all
>=20
> I'm running into a weird error on Cassandra 1.0.7.
> As my clusters load gets heavier many of the nodes seem to hit the =
same error around the same time, resulting in MutationStage backing up =
and never clearing down. The only way to recover the cluster is to kill =
all the nodes and start them up again. The error is as below and is =
repeated continuously until I kill the Cassandra process.
>=20
> ERROR [ReplicateOnWriteStage:57] 2012-03-21 14:02:05,099 =
AbstractCassandraDaemon.java (line 139) Fatal exception in thread =
Thread[ReplicateOnWriteStage:57,5,main]
> java.lang.RuntimeException: java.util.concurrent.TimeoutException
>         at =
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro=
xy.java:1227)
>         at =
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.=
java:886)
>         at =
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java=
:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.TimeoutException
>         at =
org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StoragePro=
xy.java:301)
>         at =
org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.jav=
a:544)
>         at =
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro=
xy.java:1223)
>         ... 3 more
>=20


--Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">The =
node is overloaded with hints. &nbsp;<div><div><br></div><div>I'll just =
grab the comments from code=85</div><div><br></div><div><div>&nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // avoid OOMing due to excess hints. =
&nbsp;we need to do this check even for "live" nodes, since we =
can</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // still =
generate hints for those if it's overloaded or simply dead but not yet =
known-to-be-dead.</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // =
The idea is that if we have over maxHintsInProgress hints in flight, =
this is probably due to</div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; // a small number of nodes causing problems, so we should avoid =
shutting down writes completely to</div><div>&nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; // healthy nodes. &nbsp;Any node with no hintsInProgress =
is considered healthy.</div><div><br></div><div>Are the nodes going up =
and down a lot ? Are they under GC pressure. The other possibility is =
that you have overloaded the =
cluster.&nbsp;</div><div><br></div><div>Cheers</div><div><br></div>
<br><div apple-content-edited=3D"true">
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: =
0px; text-transform: none; white-space: normal; widows: 2; word-spacing: =
0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></div></span></span>
</div>
<br><div><div>On 22/03/2012, at 3:20 AM, Thomas van Neerijnen =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">Hi all<br><br>I'm running into a weird error on Cassandra =
1.0.7.<br>As my clusters load gets heavier many of the nodes seem to hit =
the same error around the same time, resulting in MutationStage backing =
up and never clearing down. The only way to recover the cluster is to =
kill all the nodes and start them up again. The error is as below and is =
repeated continuously until I kill the Cassandra process.<br>
<br>ERROR [ReplicateOnWriteStage:57] 2012-03-21 14:02:05,099 =
AbstractCassandraDaemon.java (line 139) Fatal exception in thread =
Thread[ReplicateOnWriteStage:57,5,main]<br>java.lang.RuntimeException: =
java.util.concurrent.TimeoutException<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at =
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro=
xy.java:1227)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at =
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.=
java:886)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at =
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java=
:908)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at =
java.lang.Thread.run(Thread.java:662)<br>Caused by: =
java.util.concurrent.TimeoutException<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp; at =
org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StoragePro=
xy.java:301)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at =
org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.jav=
a:544)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at =
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro=
xy.java:1223)<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ... 3 =
more<br><br>
</blockquote></div><br></div></div></body></html>=

--Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480--