incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas van Neerijnen <...@bossastudios.com>
Subject Re: ReplicateOnWriteStage exception causes a backlog in MutationStage that never clears
Date Thu, 22 Mar 2012 01:55:06 GMT
Hi

I'm going with yes to all three of your questions.

I found a very heavily hit index which we have since reworked to remove the
secondry index entirely.
This fixed a large portion of the problem but during the panic of the
overloaded cluster we did the simple scaling out trick of doubling the
cluster, however in the rush two out of the 7 new nodes accidentally ended
up on EC2 EBS volumes instead of the usual ephemeral RAID10.
So, same error but this time all nodes reporting only the two EBS backed
nodes as down instead of the whole cluster getting weird.
I'm rsyncing the data off the EBS volume onto an ephemeral RAID10 array as
I type so in the next hour or so I'll know if this fixed the issue.

On Wed, Mar 21, 2012 at 5:24 PM, aaron morton <aaron@thelastpickle.com>wrote:

> The node is overloaded with hints.
>
> I'll just grab the comments from codeā€¦
>
>             // avoid OOMing due to excess hints.  we need to do this check
> even for "live" nodes, since we can
>             // still generate hints for those if it's overloaded or simply
> dead but not yet known-to-be-dead.
>             // The idea is that if we have over maxHintsInProgress hints
> in flight, this is probably due to
>             // a small number of nodes causing problems, so we should
> avoid shutting down writes completely to
>             // healthy nodes.  Any node with no hintsInProgress is
> considered healthy.
>
> Are the nodes going up and down a lot ? Are they under GC pressure. The
> other possibility is that you have overloaded the cluster.
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 22/03/2012, at 3:20 AM, Thomas van Neerijnen wrote:
>
> Hi all
>
> I'm running into a weird error on Cassandra 1.0.7.
> As my clusters load gets heavier many of the nodes seem to hit the same
> error around the same time, resulting in MutationStage backing up and never
> clearing down. The only way to recover the cluster is to kill all the nodes
> and start them up again. The error is as below and is repeated continuously
> until I kill the Cassandra process.
>
> ERROR [ReplicateOnWriteStage:57] 2012-03-21 14:02:05,099
> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
> Thread[ReplicateOnWriteStage:57,5,main]
> java.lang.RuntimeException: java.util.concurrent.TimeoutException
>         at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1227)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.TimeoutException
>         at
> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:301)
>         at
> org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.java:544)
>         at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1223)
>         ... 3 more
>
>
>

Mime
View raw message