cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anuj Wadehra <>
Subject Run Repairs when a Node is Down
Date Sun, 17 Jan 2016 04:33:26 GMT
We are on 2.0.14,RF=3 in a 3 node cluster. We use repair -pr . Recently, we observed that
repair -pr for all nodes fails if a node is down. Then I found the JIRA
where an intentional decision was taken to abort the repair if a replica is down.
I need to understand the reasoning behind aborting the repair instead of proceeding with available
I have following concerns with the approach:
We say that we have a fault tolerant Cassandra system such that we can afford single node
failure because RF=3 and we read/write at QUORUM.But when a node goes down and we are not
sure how much time will be needed to restore the node, entire system health is in question
as gc_grace_period is approaching and we are not able to run repair -pr on any of the nodes.
Then there is a dilemma:
Whether to remove the faulty node well before gc grace period so that we get enough time to
save data by repairing other two nodes?
This may cause massive streaming which may be unnecessary if we are able to bring back the
faulty node up before gc grace period.
Wait and hope that the issue will be resolved before gc grace time and we will have some buffer
to run repair -pr on all nodes.
Increase the gc grace period temporarily. Then we should have capacity planning to accomodate
the extra storage needed for extra gc grace that may be needed in case of node failure scenarios.

Besides knowing the reasoning behind the decision taken in CASSANDRA-2290, I need to understand
the recommeded approach for maintaing a fault tolerant system which can handle node failures
such that repair can be run smoothly and system health is maintained at all times.

ThanksAnuj Sent from Yahoo Mail on Android
View raw message