We're testing expanding a 4-node cluster into an 8-node cluster, and we keep running into issues with the repair process near the end.
We're bringing up nodes 1-by-1 into the cluster, retokening nodes for an 8-node configuration, running nodetool cleanup on the nodes after each retokening, and then increasing the replication factor to 5. This all works without issue, and the cluster appears to be healthy in that 8-node configuration with a replication factor of 5.
However, when we then run nodetool repair on the nodes, it will at some point stall, even when being run on one of the new nodes.
It doesn't appear to stall while it's performing a compaction or transferring CF data. We've monitored compactionstats and netstats closely, and things always stall when a repair command is started, ie:
[2013-10-02 23:19:39,254] Starting repair command #9, repairing 5 ranges for keyspace ourkeyspace
The last message from AntiEntropyService is usually something to the effect of:
<190>Oct 3 00:01:02 myhost.com
1970947950 [AntiEntropySessions:24] INFO org.apache.cassandra.service.AntiEntropyService - [repair #9b17d310-2bbd-11e3-0000-e06ec6c436ff] session completed successfully
... and then things don't start for the next repair. Nothing in the logs that looks related.
Where this occurs is arbitrary. If I run on individual CFs within ourkeyspace, some will succeed, and some will fail, but if we start over and do the 4-node to 8-node expansion again, things will fail at a different place.
Advice as to what to look at next?