Any comments on the repair -pr scenarios..please share how you deal with such scenarios..


Subject:Handle Node Failure with Repair -pr

I need comments on my understanding of repair -pr ..If you are using repair -pr in your cluster then following statements hold true:

1. If a node goes down for long time and your not sure when will it return, you must ensure that subrange repair for the defected node range is done within gc_grace_period from some other node?

 I think the mandatory requirement for repair must be restated to make it explicit. While saying that each node must run repair -pr within gc grace, we must clearly mention that each node' s range must be repaired and care must be taken to run subrange repair from separate node in case a node is down and gc grace is approaching.Otherwise no repair -pr job on nodes will repair that subrange even though all live nodes were meeting the norm of running repair -pr within gc grace.

2. If you forgot to run repair -pr within gc grace seconds on one of the nodes, deleting data folder and autobootstrapping will not help as subrange for node was never repaired and any node with missed delete will popup the data back.You can only minimize deletes from popping up but cant prevent them completely.