cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Olsson (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling
Date Fri, 05 Feb 2016 14:15:40 GMT


Marcus Olsson commented on CASSANDRA-10070:

[~yukim] [~pauloricardomg] Thanks for the comments, great questions/suggestions!

Regarding your questions about the locking:
* What would "lock resource" be like for repair scheduling? I think the value controls number
of repair jobs running at given time in the whole cluster, and we don't want to run as many
repair jobs at once.
* I second Yuki Morishita's first question above, in that we need to better specify how is
cluster-wide repair parallelism handled: is it fixed or configurable? can a node run repair
for multiple ranges in parallel? Perhaps we should have a node_repair_paralellism (default
1) and dc_repair_parallelism (default 1) global config and reject starting repairs above those
The thought with the lock resource was that it could be something simple, like a table defined
resource text PRIMARY KEY
And then the different nodes would try to get the lock using LWT with TTL:
INSERT INTO lock (resource) VALUES ('RepairResource') IF NOT EXISTS USING TTL 30;
After that the node would have to continue to update the locked resource while running the
repair to prevent that someone else gets the locked resource. The value "RepairResource" could
just as easily be defined as "RepairResource-N", so that it would be possible to allow repairs
to run in parallel.

A problem with this table is that if we have a setup with two data centers and three replicas
in each data center, then we have a total of six replicas and QUORUM would require four replicas
to succeed. This would require that both data centers are available to be able to run repair.
Since some of the keyspaces might not be replicated across both data centers we would still
have to be able to run repair even if one of the data centers is unavailable. This also applies
if we should "force" local dc repairs if a data center has been unavailable too long. There
are two options as I see it on how to solve this:
* Get the lock with local_serial during these scenarios.
* Have a separate lock table for each data center *and* a global one.

I guess the easiest solution would be to use local_serial, but I'm not sure if it might cause
some unexpected behavior. If we would go for the other option with separate tables it would
probably increase the overall complexity, but it would make it easier to restrict the number
of parallel repairs in a single data center.

Just a questions regarding your suggestion with the node_repair_parallelism. Should it be
used to specify the number of repairs a node can initiate or how many repairs the node can
be an active part of in parallel? I guess the second alternative would be harder to implement,
but it is probably what one would expect.


* It seems the scheduling only makes sense for repairing primary range of the node ('nodetool
-pr') since we end up repairing all nodes eventually. Are you considering other options like
subrange ('nodetool -st -et') repair?
* For subrange repair, we could maybe have something similar to reaper's segmentCount option,
but since this would add more complexity we could leave for a separate ticket.

It should be possible to extend the repair scheduler with subrange repairs, either by having
it as an option per table or by having a separate scheduler for it. The separate scheduler
would just be another plugin that could replace the default repair scheduler. If we go for
a table configuration it could be that the user either specifies pr or the number of segments
to divide the token range in, something like:
repair_options = {..., token_division='pr'}; // Use primary range repair
repair_options = {..., token_division='2048'}; // Divide the token range in 2048 slices
If we would have a separate scheduler it could just be a configuration for it. Personally
I would prefer to have it all in a single scheduler and I agree that it should probably be
a separate ticket to keep the complexity of the base scheduler to a minimum. But I think this
is a feature that will be very much needed both with non-vnode token assignment and also with
the possibility to reduce the number of vnodes as of CASSANDRA-7032.


* While pausing repair is a nice future for user-based interruptions, we could probably embed
system known interruptions (such as when a bootstrap or upgrade is going on) in the default
rejection logic.

Agreed, are there any other scenarios that we might have to take into account?

> Automatic repair scheduling
> ---------------------------
>                 Key: CASSANDRA-10070
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>         Attachments: Distributed Repair Scheduling.doc
> Scheduling and running repairs in a Cassandra cluster is most often a required task,
but this can both be hard for new users and it also requires a bit of manual configuration.
There are good tools out there that can be used to simplify things, but wouldn't this be a
good feature to have inside of Cassandra? To automatically schedule and run repairs, so that
when you start up your cluster it basically maintains itself in terms of normal anti-entropy,
with the possibility for manual configuration.

This message was sent by Atlassian JIRA

View raw message