On Fri, Aug 2, 2013 at 3:28 PM, Mohit Anchlia <mohitanchlia@gmail.com> wrote:

We currently run automated repairs sequentially on all the nodes. However, as we grow the cluster we now need to run repair on multiple nodes in parallel to be able to finish it withing gcgrace seconds.

Or you could just increase gc_grace_seconds from the arbitrary and IMO unreasonably low default of 10 days.

Before I write the script I was wondering if somebody already has a tool or a script that figures out nodes that we can safely run repairs on in parallel. For instance we wouldn't run repair on replica nodes in parallel.

This will only really work with non-virtual nodes, if you repair hardware-node-wide. With 256 virtual nodes per node, your repair overhead will also be evenly distributed.

Someone has probably written the script, but if I were you I would consider whether you really want to monitor N/RF fragile and independent repair sessions simultaneously before using such a script.

=Rob