I am running 1.0.8.  Two data center with 8 machines in each dc.  Nodes are all up while repairing is running.  No dropped Mutations/Messages.  I do see HintedHandoff messages.


On Tue, May 8, 2012 at 11:15 PM, Vijay <vijay2win@gmail.com> wrote:
What is the version you are using? is it Multi DC setup? Are you seeing a lot of dropped Mutations/Messages? Are the nodes going up and down all the time while the repair is running? 


On Tue, May 8, 2012 at 2:05 PM, Bill Au <bill.w.au@gmail.com> wrote:
There are no error message in my log.

I ended up restarting all the nodes in my cluster.  After that I was able to run repair successfully on one of the node.  It took about 40 minutes.  Feeling lucky I ran repair on another node and it is stuck again.

tpstats show 1 active and 1 pending AntiEntropySessions.  netstats and compactionstats show no activity.  I took a close look at the log file, it shows that the node requested merkle tree from 4 nodes (including itself).  It actually received 3 of those merkle trees.  It looks like it is stuck waiting for that last one.  I checked the node where the request was sent to, there isn't anything in the log on repair.  So it looks like the merkle tree request has gotten lost some how.  It has been 8 hours since the repair was issue and it is still stuck.  I am going to let it run a bit longer to see if it will eventually finish.

I have observed that if I restart all the nodes, I would be able to run repair successfully on a single node.  I have done that twice already.  But after that all repairs will hang.  Since we are supposed to run repair periodically, having to restart all nodes before running repair on each node isn't really viable for us.


On Tue, May 8, 2012 at 6:04 AM, aaron morton <aaron@thelastpickle.com> wrote:
When you look in the logs please let me know if you see this error…

I look at nodetool compactionstats (for the Merkle tree phase),  nodetool netstats for the streaming, and this to check for streaming progress:

while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 && nodetool -h localhost netstats); done

Or use Data Stax Ops Centre where possible http://www.datastax.com/products/opscenter


Aaron Morton
Freelance Developer

On 8/05/2012, at 2:15 PM, Ben Coverston wrote:

Check the log files for warnings or errors. They may indicate why your repair failed.

On Mon, May 7, 2012 at 10:09 AM, Bill Au <bill.w.au@gmail.com> wrote:
I restarted the nodes and then restarted the repair.  It is still hanging like before.  Do I keep repeating until the repair actually finish?


On Fri, May 4, 2012 at 2:18 PM, Rob Coli <rcoli@palominodb.com> wrote:
On Fri, May 4, 2012 at 10:30 AM, Bill Au <bill.w.au@gmail.com> wrote:
> I know repair may take a long time to run.  I am running repair on a node
> with about 15 GB of data and it is taking more than 24 hours.  Is that
> normal?  Is there any way to get status of the repair?  tpstats does show 2
> active and 2 pending AntiEntropySessions.  But netstats and compactionstats
> show no activity.

As indicated by various recent threads to this effect, many versions
of cassandra (including current 1.0.x release) contain bugs which
sometimes prevent repair from completing. The other threads suggest
that some of these bugs result in the state you are in now, where you
do not see anything that looks like appropriate activity.
Unfortunately the only solution offered on these other threads is the
one I will now offer, which is to restart the participating nodes and
re-start the repair. I am unaware of any JIRA tickets tracking these
bugs (which doesn't mean they don't exist, of course) so you might
want to file one. :)


=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Ben Coverston
DataStax -- The Apache Cassandra Company