Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 476DF9F2E for ; Wed, 21 Mar 2012 18:13:30 +0000 (UTC) Received: (qmail 66394 invoked by uid 500); 21 Mar 2012 18:13:28 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 66371 invoked by uid 500); 21 Mar 2012 18:13:28 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 66362 invoked by uid 99); 21 Mar 2012 18:13:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Mar 2012 18:13:28 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a78.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Mar 2012 18:13:19 +0000 Received: from homiemail-a78.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a78.g.dreamhost.com (Postfix) with ESMTP id 7958815C0EC for ; Wed, 21 Mar 2012 11:12:42 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=3j6ESt4num UPFlyD4s+BZK7Utp+ETIlbxmyxQJQ/YZBs4F+tPLEXYwplqnou1E0hE2ryx55lrb 2ZlG56bOrro0Gi2UXrkF1jd3HHYZN6QoZWY80FkZo9rkYBk73ejaFRxAu8qI15U0 FDCjxx1W7bZo+sl4wIMhYks5RWxRSZF0I= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=I2XLsZ+CEEIxFN/K 45YmmJ63Ab4=; b=fED25R1oq/VogRulZTfcbfscPjD9NOD3XBbTlTJyfkVOV+fc eXvA8e9v2L7t++73YeN8dQnWE4UmZEwDTwv3C6YkwytFMLKTPTuMKcjCzgewVl+j z6RdTnryYvedRhFyelQ00abO8sSKXXNWffbq5Gc/5IzaaFHDS3Ne+bsTIcY= Received: from [172.16.1.3] (125-236-193-159.adsl.xtra.co.nz [125.236.193.159]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a78.g.dreamhost.com (Postfix) with ESMTPSA id 4784015C0F1 for ; Wed, 21 Mar 2012 10:24:36 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: multipart/alternative; boundary="Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480" Subject: Re: ReplicateOnWriteStage exception causes a backlog in MutationStage that never clears Date: Thu, 22 Mar 2012 06:24:31 +1300 In-Reply-To: To: user@cassandra.apache.org References: Message-Id: <8E41EBA7-DCF9-44C0-A2A9-CA331F396732@thelastpickle.com> X-Mailer: Apple Mail (2.1257) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 The node is overloaded with hints. =20 I'll just grab the comments from code=85 // avoid OOMing due to excess hints. we need to do this = check even for "live" nodes, since we can // still generate hints for those if it's overloaded or = simply dead but not yet known-to-be-dead. // The idea is that if we have over maxHintsInProgress hints = in flight, this is probably due to // a small number of nodes causing problems, so we should = avoid shutting down writes completely to // healthy nodes. Any node with no hintsInProgress is = considered healthy. Are the nodes going up and down a lot ? Are they under GC pressure. The = other possibility is that you have overloaded the cluster.=20 Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/03/2012, at 3:20 AM, Thomas van Neerijnen wrote: > Hi all >=20 > I'm running into a weird error on Cassandra 1.0.7. > As my clusters load gets heavier many of the nodes seem to hit the = same error around the same time, resulting in MutationStage backing up = and never clearing down. The only way to recover the cluster is to kill = all the nodes and start them up again. The error is as below and is = repeated continuously until I kill the Cassandra process. >=20 > ERROR [ReplicateOnWriteStage:57] 2012-03-21 14:02:05,099 = AbstractCassandraDaemon.java (line 139) Fatal exception in thread = Thread[ReplicateOnWriteStage:57,5,main] > java.lang.RuntimeException: java.util.concurrent.TimeoutException > at = org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro= xy.java:1227) > at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886) > at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.util.concurrent.TimeoutException > at = org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StoragePro= xy.java:301) > at = org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.jav= a:544) > at = org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro= xy.java:1223) > ... 3 more >=20 --Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252 The = node is overloaded with hints.  

I'll just = grab the comments from code=85

  =           // avoid OOMing due to excess hints. =  we need to do this check even for "live" nodes, since we = can
            // still = generate hints for those if it's overloaded or simply dead but not yet = known-to-be-dead.
            // = The idea is that if we have over maxHintsInProgress hints in flight, = this is probably due to
          =   // a small number of nodes causing problems, so we should avoid = shutting down writes completely to
        =     // healthy nodes.  Any node with no hintsInProgress = is considered healthy.

Are the nodes going up = and down a lot ? Are they under GC pressure. The other possibility is = that you have overloaded the = cluster. 

Cheers


http://www.thelastpickle.com

On 22/03/2012, at 3:20 AM, Thomas van Neerijnen = wrote:

Hi all

I'm running into a weird error on Cassandra = 1.0.7.
As my clusters load gets heavier many of the nodes seem to hit = the same error around the same time, resulting in MutationStage backing = up and never clearing down. The only way to recover the cluster is to = kill all the nodes and start them up again. The error is as below and is = repeated continuously until I kill the Cassandra process.

ERROR [ReplicateOnWriteStage:57] 2012-03-21 14:02:05,099 = AbstractCassandraDaemon.java (line 139) Fatal exception in thread = Thread[ReplicateOnWriteStage:57,5,main]
java.lang.RuntimeException: = java.util.concurrent.TimeoutException
        at = org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro= xy.java:1227)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908)
        at = java.lang.Thread.run(Thread.java:662)
Caused by: = java.util.concurrent.TimeoutException
     &nb= sp;  at = org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StoragePro= xy.java:301)
        at = org.apache.cassandra.service.StorageProxy$7$1.runMayThrow(StorageProxy.jav= a:544)
        at = org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePro= xy.java:1223)
        ... 3 = more


= --Apple-Mail=_C05074D3-6E55-44EB-BF32-8441B88B8480--