cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Yen <yulin...@gmail.com>
Subject Re: nodetool repair does not return...
Date Fri, 26 Aug 2011 02:00:37 GMT
No pending tasks for compactionstats and netstats.

On Fri, Aug 26, 2011 at 6:07 AM, aaron morton <aaron@thelastpickle.com>wrote:

> That's a thread waiting for other threads / activities to complete. Nothing
> unusual there.
>
> Work out how fair the repair gets. Is there a validation compaction listed
> in nodetool compactionstats ? Are there any streams running in nodetool
> netstats ?
>
>
> Look through the logs on the machine you start the repair on, follow the
> messages from the AnitEntrophyService. They will say when they send messages
> to other nodes to build the merkle tree and when they get the response back.
> You can then check if the other nodes respond.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 25/08/2011, at 7:02 PM, Boris Yen wrote:
>
> We tried to dump the stack trace of threads, we noticed that
>
> "manual-repair-d08349af-189f-47cb-9cc3-452538ce04d1" daemon prio=10
> tid=0x00000000406a3000 nid=0x1890 waiting on condition [0x00007f5c97be8000]
>
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <0x00007f5d4acf0f38> (a java.util.concurrent.CountDownLatch$Sync)
> 	at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
Source)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown
Source)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown
Source)
> 	at java.util.concurrent.CountDownLatch.await(Unknown Source)
>
> at
> org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:665)
>
>
> This seems to be the thread which causes the repair to hang.
>
> We also noticed another odd thing, sometimes we can see lots [WRITE-/...]
> threads.
>
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)	
> Thread [WRITE-/10.2.0.87] (Running)
>
>
> On Thu, Aug 25, 2011 at 11:10 AM, Boris Yen <yulinyen@gmail.com> wrote:
>
>> Would Cassandra-2433 cause this?
>>
>>
>> On Wed, Aug 24, 2011 at 7:23 PM, Boris Yen <yulinyen@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In our testing environment, we got two nodes with RF=2 running 0.8.4. We
>>> tried to test the repair functions of cassandra, however, every once a
>>> while, the "nodetool repair" never returns. We have checked the system.log,
>>> nothing seems to be out of ordinary, no errors, no exceptions. The data is
>>> only 50 mb, and it is consistently updated.
>>>
>>> Shutting down one node during the repair process could cause similar
>>> symptom. So, our original thought is that maybe one of the TreeRequest is
>>> not sent to the other node correctly, that might cause the repair to run
>>> forever. However, I did not see any relative log msg to support that. I am
>>> kind of running out of idea about this... Does anyone also has this problem?
>>>
>>> Regards
>>> Boris
>>>
>>
>>
>
>

Mime
View raw message