incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: nodetool repair does not return...
Date Thu, 25 Aug 2011 22:07:32 GMT
That's a thread waiting for other threads / activities to complete. Nothing unusual there.


Work out how fair the repair gets. Is there a validation compaction listed in nodetool compactionstats
? Are there any streams running in nodetool netstats ? 


Look through the logs on the machine you start the repair on, follow the messages from the
AnitEntrophyService. They will say when they send messages to other nodes to build the merkle
tree and when they get the response back. You can then check if the other nodes respond. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/08/2011, at 7:02 PM, Boris Yen wrote:

> We tried to dump the stack trace of threads, we noticed that
> 
> "manual-repair-d08349af-189f-47cb-9cc3-452538ce04d1" daemon prio=10 tid=0x00000000406a3000
nid=0x1890 waiting on condition [0x00007f5c97be8000]
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <0x00007f5d4acf0f38> (a java.util.concurrent.CountDownLatch$Sync)
> 	at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
Source)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown
Source)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown
Source)
> 	at java.util.concurrent.CountDownLatch.await(Unknown Source) 
> 	at org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:665)

> 
> This seems to be the thread which causes the repair to hang.
> 
> We also noticed another odd thing, sometimes we can see lots [WRITE-/...] threads.
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)
> 
> On Thu, Aug 25, 2011 at 11:10 AM, Boris Yen <yulinyen@gmail.com> wrote:
> Would Cassandra-2433 cause this?
> 
> 
> On Wed, Aug 24, 2011 at 7:23 PM, Boris Yen <yulinyen@gmail.com> wrote:
> Hi,
> 
> In our testing environment, we got two nodes with RF=2 running 0.8.4. We tried to test
the repair functions of cassandra, however, every once a while, the "nodetool repair" never
returns. We have checked the system.log, nothing seems to be out of ordinary, no errors, no
exceptions. The data is only 50 mb, and it is consistently updated.
> 
> Shutting down one node during the repair process could cause similar symptom. So, our
original thought is that maybe one of the TreeRequest is not sent to the other node correctly,
that might cause the repair to run forever. However, I did not see any relative log msg to
support that. I am kind of running out of idea about this... Does anyone also has this problem?
> 
> Regards
> Boris
> 
> 


Mime
View raw message