cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling
Date Tue, 23 Feb 2016 14:36:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158968#comment-15158968
] 

Paulo Motta commented on CASSANDRA-10070:
-----------------------------------------

bq. But in that case the pause/stop feature should be implemented as early as possible to
avoid having an upgrade scenario that requires the user to upgrade to the version that introduces
the pause feature before upgrading to the latest. Another way would be to have the "system
interrupts" feature in place early, so that the repairs would be paused during an upgrade.

Sounds good! We could ask the user to pause, but I think doing that automatically via "system
interrupts" is better. It just ocurred to me that both "the pause" or "system interrupts"
will prevent new repairs from starting, but what about already running repairs? We will probably
want to interrupt already running repairs as well in some situations. For this reason CASSANDRA-3486
is also relevant for this ticket (adding it as a dependency of this ticket).

bq. I think the timeout might be good to have to prevent a hang from stopping the entire repair
process. But I think it would only work if the repair would only hang occasionally, otherwise
the same repair would be retried until it is marked as a "fail". 

+1. Then I think we should either have timeout, or add an ability to cancel/interrupt a running
scheduled repair in the initial version, to avoid hanging repairs to render the automatic
repair scheduling useless.

bq. Another option is to have a "slow repair"-detector that would log a warning if a repair
session is taking too long time, to avoid aborting it if it's actually repairing and leaving
it up to the user to handle it. Either way I'd say it's out of the scope of the initial version.

bq. We might also want to be able to detect if it would be impossible to repair the whole
cluster within gc grace and report it to the user. This could happen for multiple reasons
like too many tables, too many nodes, too few parallel repairs or simply overload. I guess
it would be hard to make accurate predictions with all of these variables so it might be good
enough to check through the history of the repairs, do an estimation of the time and compare
it to gc grace? I think this is something out of scope for the first version, but I thought
I'd just mention it here to remember it.

Nice! These could probably live in a separate repair metrics and alert module in the future,
allowing users to track statistics, issue alerts/warnings based on history and allow the scheduler
to perform more advanced adaptive scheduling. Some metrics to track:
* Repair time per session
** Break up of time per phase (validation, sync, anticompaction, etc)
* Repair time per node
* Validation mismatch %
* Fail count

bq. Should we maybe compile a list of "features that should be in the initial version" and
also a "improvements" list for future work to make the scope clear?

Sounds good! Below is a suggested list of subtasks:

* Basic functionality
** Resource locking API and implementation
** Maintenance scheduling API and metadata
** Basic scheduling support
** Polling and monitoring module
** Pausing and aborting support	
** Rejection policies (includes system interrupts and maintenance windows)
** Failure handling and retry
** Configuration support
** Frontend support (table options, management commands)

* Optional/deferred functionality
** Parallel repair session support
** Subrange repair support
** Maintenance history
** Timeout
** Metrics
** Alerts

WDYT? Feel free to update or break-up into smaller or larger subtasks, and then create the
actual subtasks to start work on them.

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a required task,
but this can both be hard for new users and it also requires a bit of manual configuration.
There are good tools out there that can be used to simplify things, but wouldn't this be a
good feature to have inside of Cassandra? To automatically schedule and run repairs, so that
when you start up your cluster it basically maintains itself in terms of normal anti-entropy,
with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message