cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <>
Subject Re: Is it safe to stop a read repair and any suggestion on speeding up repairs
Date Thu, 21 Jul 2011 22:27:00 GMT
nit pick: nodetool repair is just called repair (or the Anti Entropy Service). Read Repair
is something that happens during a read request. 

Short answer, yes it's safe to kill cassandra during a repair. It's one of the nice things
about never mutating data. 

Longer answer: If nodetool compactionstats says there are no Validation compactions running
(and the compaction queue is empty)  and netstats says there is nothing streaming there is
a a good chance the repair is finished or dead. If a neighbour dies during a repair the node
it was started on will wait for 48 hours(?) until it times out. Check the logs on the machines
for errors, particularly from the AntiEntropyService. And see what compactionstats is saying
on all the nodes involved in the repair.

Even Longer: um, 3 TB of data is *way* to much data per node, generally happy people have
up to about 200 to 300GB per node. The reason for this recommendation is so that things like
repair, compaction, node moves, etc are managable  and because the loss of a single node has
less of an impact. I would not recommend running a live system with that much data per node.

Hope that helps. 

Aaron Morton
Freelance Cassandra Developer

On 22 Jul 2011, at 03:51, Adi wrote:

> We have a 4 node 0.7.6 cluster. RF=2 , 3 TB data per node. 
> A read repair was kicked off on node 4 last week and is still in progress. 
> Later I kicked of read repair on node 2 a few days back.
> We were writing(read/write/updates/NO deletes) data while the repair was in progress
but no data has been written for the past 3-4 days. 
> I was hoping the repair should get done in that time-frame before proceeding with further
> Would it be safe to stop it and kick it off per column family or do a full scan of all
keys as suggested in an earlier discussion? Any other suggestion on hastening this repair.
> On both nodes the repair Thread is waiting at this stage for a long time(~60+ hours)
>  java.lang.Thread.State: WAITING
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <580857f3> (a org.apache.cassandra.utils.SimpleCondition)
> 	at java.lang.Object.wait(
> 	at org.apache.cassandra.utils.SimpleCondition.await(
> 	at org.apache.cassandra.service.AntiEntropyService$
>    Locked ownable synchronizers:
> 	- None
> A CPU sampling for few minutes shows these methods as hot spots(mostly the top two)
> org.apache.cassandra.db.ColumnFamilyStore.isKeyInRemainingSSTables( )
> org.apache.cassandra.utils.BloomFilter.getHashBuckets( ) 
> netstats does not show anything streaming to/from any of the nodes.
> -Adi Pandit

View raw message