cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11190) Fail fast repairs
Date Fri, 19 Feb 2016 13:13:18 GMT


Paulo Motta commented on CASSANDRA-11190:

A question that arises: should this be default and only behavior, or should we keep allowing
the previous behavior of letting sync/validation progress if repair fails on an unrelated

On CASSANDRA-5426 [~yukim] suggested having a {{--keep-going}} flag to allow the previous
behavior. IMO we should fail fast always to prevent unexpected conditions.

> Fail fast repairs
> -----------------
>                 Key: CASSANDRA-11190
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Minor
> Currently, if one node fails any phase of the repair (validation, streaming), the repair
session is aborted, but the other nodes are not notified and keep doing either validation
or syncing with other nodes.
> With CASSANDRA-10070 automatically scheduling repairs and potentially scheduling retries
it would be nice to make sure all nodes abort failed repairs in other to be able to start
other repairs safely in the same nodes.
> From CASSANDRA-10070:
> bq. As far as I understood, if there are nodes A, B, C running repair, A is the coordinator.
If validation or streaming fails on node B, the coordinator (A) is notified and fails the
repair session, but node C will remain doing validation and/or streaming, what could cause
problems (or increased load) if we start another repair session on the same range.
> bq. We will probably need to extend the repair protocol to perform this cleanup/abort
step on failure. We already have a legacy cleanup message that doesn't seem to be used in
the current protocol that we could maybe reuse to cleanup repair state after a failure. This
repair abortion will probably have intersection with CASSANDRA-3486. In any case, this is
a separate (but related) issue and we should address it in an independent ticket, and make
this ticket dependent on that.
> On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected conditions/hangs:
> bq. I wonder if maybe we should have more of a fail-fast policy when there is errors.
For instance, if one node fail it's validation phase, maybe it might be worth failing right
away and let the user re-trigger a repair once he has fixed whatever was the source of the
error, rather than still differencing/syncing the other nodes.
> bq. Going a bit further, I think we should add 2 messages to interrupt the validation
and sync phase. If only because that could be useful to users if they need to stop a repair
for some reason, but also, if we get an error during validation from one node, we could use
that to interrupt the other nodes and thus fail fast while minimizing the amount of work done

This message was sent by Atlassian JIRA

View raw message