cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Dejanovski (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra
Date Fri, 30 Mar 2018 13:50:00 GMT


Alexander Dejanovski commented on CASSANDRA-14346:

I really like the idea of making repair something that is coordinated by the cluster instead
of being node centric like currently.
This is how it should be implemented, and external tools should only add features over this.
nodetool really should be doing this by default.
I globally agree with the state machine that is detailed (haven't spent that much time on
it though...)

I disagree with the doc Resiliency's point 6 that adding nodes won't impact the repair : it
will change the token ranges and some of the splits will now spread across different replicas
which will make them unsuitable for repair (think of clusters with 256 vnodes per node).
You either have to cancel the repair or recompute the remaining splits to move on with the

I would add a feature to your nodetool repairstatus command that allows to list only the currently
running repairs.

Then I think the approach of implementing a fully automated, seamless, continuous repair "that
just works" without user intervention is unsafe in the wild, there are too many caveats.
There are many different types of cluster out there and some of them just cannot run repair
without careful tuning or monitoring (if at all).
The current design shows no backpressure mechanism to ensure that further running sequences
won't harm the cluster because it's already running late on compactions (may it be due to
overstreaming or entropy, or just the activity of the cluster).
Repairing by table will add a lot of overhead over repairing a list of tables (or all) in
a single session, unless multiple repairs at once on a node are allowed, which won't permit
to safely terminate a single repair.
It is also unclear in the current design if repair can be disabled for select tables for example
(like "type: none").
The proposal doesn't seem to involve any change into how "nodetool repair" behaves. Will it
be changed to use the state machine and coordinate throughout the cluster ?

Trying to replace external tools with built in features has its limits I think, and currently
the design gives only limited control by such external tools (may it be Reaper or Datastax
repair service or Priam or ...).
To make an analogy that was seen recently on the ML, it's as if you implemented automatic
spreading of configuration changes from within Cassandra instead of relying on tools like
Chef or Puppet.
You'll still need global tools to manage repairs over several clusters anyway, which a Cassandra
built-in feature cannot (and should not) provide.

My point is that making repair smarter and coordinated within Cassandra is a great idea and
I support it 100%, but the current design makes it too automated and the defaults could easily
lead to severe performance problems without the user triggering anything.
I don't know either how it could be made to work along user defined repairs as you'll need
to force terminate some sessions.

To summarize, I would put aside the scheduling features and implement the coordinated repairs
by splits within Cassandra. The StorageServiceMBean should evolve to allow manually setting
the number of splits by node, or rely on a number of split generated by Cassandra itself.
Then it should also be possible to track progress externally by listing splits (sequences)
through JMX, and pause/resume select repair runs.

Also, the current design should evolve to allow a single sequence to include multiple token
ranges. We have that feature waiting to be merged in Reaper to group token ranges that have
the same replicas, in order to reduce the overhead of vnodes.
Starting with 3.0, repair jobs can be triggered with multiple token ranges that will be executed
as a single session if the replicas are the same for all. So, to prevent having to change
the data model in the future, I'd suggest storing a list of token ranges instead of just one.
Repair events should be tracked in a separate table also to avoid overwriting the last event
each time (one thing Reaper currently sucks at as well).

I'll go back to the document soon and add my comments there.



> Scheduled Repair in Cassandra
> -----------------------------
>                 Key: CASSANDRA-14346
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>         Attachments: ScheduledRepairV1_20180327.pdf
> There have been many attempts to automate repair in Cassandra, which makes sense given
that it is necessary to give our users eventual consistency. Most recently CASSANDRA-10070,
CASSANDRA-8911 and CASSANDRA-13924 have all looked for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), which
we spoke about last year at NGCC. Given the positive feedback at NGCC we focussed on getting
it production ready and have now been using it in production to repair hundreds of clusters,
tens of thousands of nodes, and petabytes of data for the past six months. Also based on feedback
at NGCC we have invested effort in figuring out how to integrate this natively into Cassandra
rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our implementation into
Cassandra, and have created a [design document|]
showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would be greatly
appreciated about the interface or v1 implementation features. I have tried to call out in
the document features which we explicitly consider future work (as well as a path forward
to implement them in the future) because I would very much like to get this done before the
4.0 merge window closes, and to do that I think aggressively pruning scope is going to be
a necessity.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message