cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijay <vijay2...@gmail.com>
Subject Re: getting status of long running repair
Date Wed, 09 May 2012 16:10:57 GMT
Are you by using Broadcast Address? if yes then you might be affected by
https://issues.apache.org/jira/browse/CASSANDRA-3503

>>> Nodes are all up while repairing is running.
I should have been clear are you seeing the following messages in logs
(UP/DOWN) during the period of the repair...
 INFO [GossipStage:1] 2012-05-01 19:52:00,515 Gossiper.java (line 804)
InetAddress /xx.xx.xx.xx is now UP


Regards,
</VJ>



On Wed, May 9, 2012 at 5:49 AM, Bill Au <bill.w.au@gmail.com> wrote:

> I am running 1.0.8.  Two data center with 8 machines in each dc.  Nodes
> are all up while repairing is running.  No dropped Mutations/Messages.  I
> do see HintedHandoff messages.
>
> Bill
>
>
> On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2win@gmail.com> wrote:
>
>> What is the version you are using? is it Multi DC setup? Are you seeing a
>> lot of dropped Mutations/Messages? Are the nodes going up and down all the
>> time while the repair is running?
>>
>> Regards,
>> </VJ>
>>
>>
>>
>>
>> On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w.au@gmail.com> wrote:
>>
>>> There are no error message in my log.
>>>
>>> I ended up restarting all the nodes in my cluster.  After that I was
>>> able to run repair successfully on one of the node.  It took about 40
>>> minutes.  Feeling lucky I ran repair on another node and it is stuck again.
>>>
>>> tpstats show 1 active and 1 pending AntiEntropySessions.  netstats and
>>> compactionstats show no activity.  I took a close look at the log file, it
>>> shows that the node requested merkle tree from 4 nodes (including itself).
>>> It actually received 3 of those merkle trees.  It looks like it is stuck
>>> waiting for that last one.  I checked the node where the request was sent
>>> to, there isn't anything in the log on repair.  So it looks like the merkle
>>> tree request has gotten lost some how.  It has been 8 hours since the
>>> repair was issue and it is still stuck.  I am going to let it run a bit
>>> longer to see if it will eventually finish.
>>>
>>> I have observed that if I restart all the nodes, I would be able to run
>>> repair successfully on a single node.  I have done that twice already.  But
>>> after that all repairs will hang.  Since we are supposed to run repair
>>> periodically, having to restart all nodes before running repair on each
>>> node isn't really viable for us.
>>>
>>> Bill
>>>
>>>
>>> On Tue, May 8, 2012 at 6:04 AM, aaron morton <aaron@thelastpickle.com>wrote:
>>>
>>>> When you look in the logs please let me know if you see this error…
>>>> https://issues.apache.org/jira/browse/CASSANDRA-4223
>>>>
>>>> I look at nodetool compactionstats (for the Merkle tree phase),
>>>>  nodetool netstats for the streaming, and this to check for streaming
>>>> progress:
>>>>
>>>> while true; do date; diff <(nodetool -h localhost netstats) <(sleep
5
>>>> && nodetool -h localhost netstats); done
>>>>
>>>> Or use Data Stax Ops Centre where possible
>>>> http://www.datastax.com/products/opscenter
>>>>
>>>> Cheers
>>>>
>>>>
>>>>   -----------------
>>>> Aaron Morton
>>>> Freelance Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>> On 8/05/2012, at 2:15 PM, Ben Coverston wrote:
>>>>
>>>> Check the log files for warnings or errors. They may indicate why your
>>>> repair failed.
>>>>
>>>> On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w.au@gmail.com> wrote:
>>>>
>>>>> I restarted the nodes and then restarted the repair.  It is still
>>>>> hanging like before.  Do I keep repeating until the repair actually finish?
>>>>>
>>>>> Bill
>>>>>
>>>>>
>>>>> On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rcoli@palominodb.com>
wrote:
>>>>>
>>>>>> On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w.au@gmail.com>
wrote:
>>>>>> > I know repair may take a long time to run.  I am running repair
on
>>>>>> a node
>>>>>> > with about 15 GB of data and it is taking more than 24 hours.
 Is
>>>>>> that
>>>>>> > normal?  Is there any way to get status of the repair?  tpstats
>>>>>> does show 2
>>>>>> > active and 2 pending AntiEntropySessions.  But netstats and
>>>>>> compactionstats
>>>>>> > show no activity.
>>>>>>
>>>>>> As indicated by various recent threads to this effect, many versions
>>>>>> of cassandra (including current 1.0.x release) contain bugs which
>>>>>> sometimes prevent repair from completing. The other threads suggest
>>>>>> that some of these bugs result in the state you are in now, where
you
>>>>>> do not see anything that looks like appropriate activity.
>>>>>> Unfortunately the only solution offered on these other threads is
the
>>>>>> one I will now offer, which is to restart the participating nodes
and
>>>>>> re-start the repair. I am unaware of any JIRA tickets tracking these
>>>>>> bugs (which doesn't mean they don't exist, of course) so you might
>>>>>> want to file one. :)
>>>>>>
>>>>>> =Rob
>>>>>>
>>>>>> --
>>>>>> =Robert Coli
>>>>>> AIM&GTALK - rcoli@palominodb.com
>>>>>> YAHOO - rcoli.palominob
>>>>>> SKYPE - rcoli_palominodb
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ben Coverston
>>>> DataStax -- The Apache Cassandra Company
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message