Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 80EF911A8C for ; Tue, 1 Jul 2014 18:21:20 +0000 (UTC) Received: (qmail 95069 invoked by uid 500); 1 Jul 2014 18:21:17 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 95035 invoked by uid 500); 1 Jul 2014 18:21:17 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95025 invoked by uid 99); 1 Jul 2014 18:21:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2014 18:21:17 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=HTML_MESSAGE,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of tarbox@cabotresearch.com does not designate 100.0.119.58 as permitted sender) Received: from [100.0.119.58] (HELO scmgateway.cabotresearch.com) (100.0.119.58) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2014 18:21:14 +0000 Received: from mail-oa0-f47.google.com (unknown [209.85.219.47]) by scmgateway.cabotresearch.com with smtp (TLS: TLSv1/SSLv3,128bits,RC4-SHA) id 5dec_06ac_7daec816_70c4_4e37_97ee_19af66a43106; Tue, 01 Jul 2014 14:10:50 -0400 Received: by mail-oa0-f47.google.com with SMTP id n16so10927005oag.20 for ; Tue, 01 Jul 2014 11:20:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=zqVzRrYRCIL9b3FRj7i9N38dsBg0Zgig18QnwXA5wFM=; b=erZFq7ze//4iwYiqXZqMvFwPcaZSvA2dSLbdIYz/+GTO2AutAeiKVN7LwRRE4KvYah byYrKITT11s45vUDrVBrbPd+t6y7/A2et0lnZwFZpP92pept33bdEi3y0K3wEC7Fg/5s asdAbDa8HJW3RJ0xc5f46a6tPsFoy+wVkB6oKxb4Vl2Wgyq4beBuxealff0NfBQSQ1kM Wb3jRU/r8UN9hB9xnvVMob+r/c2w/rbW/C/LyO+cICF1wmHM6e4eQLWhHDeL0YTBYCB1 c9l670tDoS6jSyJRENlVHLfaQu2mubNkaU+PZVaYunC06zkxL0t2WDXgdZQ0DNjZgT7X nqxQ== X-Gm-Message-State: ALoCoQmTEqdbSsQTtXD8sC9hFsyMjVa6Xr6O0ipQvQwj8b/h6W4e1LD3x1FCA2rqxtuaLaO/OQbP75ktJDAH0S69rCFJof1cuD3ic9p5iwhMY2xpMDtQBCebZUHQm/wcMmrmzIjXb3ZB X-Received: by 10.60.59.4 with SMTP id v4mr18251813oeq.63.1404238846651; Tue, 01 Jul 2014 11:20:46 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.60.59.4 with SMTP id v4mr18251801oeq.63.1404238846567; Tue, 01 Jul 2014 11:20:46 -0700 (PDT) Received: by 10.202.7.82 with HTTP; Tue, 1 Jul 2014 11:20:46 -0700 (PDT) In-Reply-To: References: Date: Tue, 1 Jul 2014 14:20:46 -0400 Message-ID: Subject: Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either From: Brian Tarbox To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=089e0129458ca13dbc04fd25d5dc X-Virus-Checked: Checked by ClamAV on apache.org --089e0129458ca13dbc04fd25d5dc Content-Type: text/plain; charset=UTF-8 Does this output from jstack indicate a problem? "ReadRepairStage:12170" daemon prio=10 tid=0x00007f9dcc018800 nid=0x7361 waiting on condition [0x00007f9db540c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000613e049d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) "ReadRepairStage:12169" daemon prio=10 tid=0x00007f9dd4009000 nid=0x7340 waiting on condition [0x00007f9db53cb000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000613e049d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) "ReadRepairStage:12168" daemon prio=10 tid=0x00007f9dd001d000 nid=0x733f waiting on condition [0x00007f9db51a6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000613e049d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) On Tue, Jul 1, 2014 at 2:09 PM, Brian Tarbox wrote: > We're running 1.2.13. > > Any chance that doing a rolling-restart would help? > > Would running without the "-pr" improve the odds? > > Thanks. > > > On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli wrote: > >> On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox >> wrote: >> >>> I have a six node cluster in AWS (repl:3) and recently noticed that >>> repair was hanging. I've run with the "-pr" switch. >>> >> >> It'll do that. >> >> What version of Cassandra? >> >> =Rob >> >> > > --089e0129458ca13dbc04fd25d5dc Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Does this output from jstack indicate a problem?

<= /div>
"ReadRepairStage:12170" daemon prio=3D10 tid=3D0x00007f= 9dcc018800 nid=3D0x7361 waiting on condition [0x00007f9db540c000]
=C2= =A0 =C2=A0java.lang.Thread.State: TIMED_WAITING (parking)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at sun.misc.Unsafe.park(Native Method)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - parking to wait for =C2=A0<0x000000= 0613e049d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$Con= ditionObject)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent= .locks.LockSupport.parkNanos(LockSupport.java:226)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.locks.AbstractQueu= edSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2= 082)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.LinkedBl= ockingQueue.poll(LinkedBlockingQueue.java:467)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.ThreadPoolExecutor= .getTask(ThreadPoolExecutor.java:1068)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor= .java:1130)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.T= hreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.lang.Thread.run(Thread.java:744)

"ReadRepairStage:12169" daemon prio=3D10 = tid=3D0x00007f9dd4009000 nid=3D0x7340 waiting on condition [0x00007f9db53cb= 000]
=C2=A0 =C2=A0java.lang.Thread.State: TIMED_WAITING (parking)=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at sun.misc.Unsafe.park(Native Method)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - parking to wait for =C2=A0<0x000000= 0613e049d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$Con= ditionObject)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent= .locks.LockSupport.parkNanos(LockSupport.java:226)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.locks.AbstractQueu= edSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2= 082)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.LinkedBl= ockingQueue.poll(LinkedBlockingQueue.java:467)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.ThreadPoolExecutor= .getTask(ThreadPoolExecutor.java:1068)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor= .java:1130)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.T= hreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.lang.Thread.run(Thread.java:744)

"ReadRepairStage:12168" daemon prio=3D10 = tid=3D0x00007f9dd001d000 nid=3D0x733f waiting on condition [0x00007f9db51a6= 000]
=C2=A0 =C2=A0java.lang.Thread.State: TIMED_WAITING (parking)=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at sun.misc.Unsafe.park(Native Method)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - parking to wait for =C2=A0<0x000000= 0613e049d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$Con= ditionObject)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent= .locks.LockSupport.parkNanos(LockSupport.java:226)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.locks.AbstractQueu= edSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2= 082)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.LinkedBl= ockingQueue.poll(LinkedBlockingQueue.java:467)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.ThreadPoolExecutor= .getTask(ThreadPoolExecutor.java:1068)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor= .java:1130)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.T= hreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.lang.Thread.run(Thread.java:744)



On Tue, Jul 1, 2014 at 2:09 PM, Brian Tarb= ox <tarbox@cabotresearch.com> wrote:
We're running 1.2.13.
Any chance that doing a rolling-restart would help?
=

Would running without the "-pr" improve the odds?<= /div>

Thanks.


On Tue, Jul 1, 2014 at 1:40 P= M, Robert Coli <rcoli@eventbrite.com> wrote:
=
On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbo= x <tarbox@cabotresearch.com> wrote:
I have a six nod= e cluster in AWS (repl:3) and recently noticed that repair was hanging. =C2= =A0I've run with the "-pr" switch.

It'll do that.

What version of Cassandra?

=3DRob
=C2=A0


--089e0129458ca13dbc04fd25d5dc--