cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuki Morishita (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-9097) Repeated incremental nodetool repair results in failed repairs due to running anticompaction
Date Fri, 15 May 2015 00:13:01 GMT


Yuki Morishita updated CASSANDRA-9097:
    Attachment: 0001-Remove-parent-session-on-remotes-when-repair-fails.patch

When repair session fails, we are only removing coordinator's parent repair session.
Currently, parent repair session is only removed when exception is thrown from ANTIENTROPY_STAGE,
but validation and streaming happen on separate threads so we have to clean them separately.

I introduced new CleanupMessage and only send it to the nodes that pass version check. So
adding new message should be fine.

Note that this is not be an issue for 2.2+, since we are sending succeeded repair ranges,
though we need to add new message to trunk for compatibility.

I will (try to) write dtest to cover this scenario, though I submit patch first for the review.

> Repeated incremental nodetool repair results in failed repairs due to running anticompaction
> --------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-9097
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Gustav Munkby
>            Assignee: Yuki Morishita
>            Priority: Minor
>             Fix For: 2.2 beta 1, 2.1.6
>         Attachments: 0001-Remove-parent-session-on-remotes-when-repair-fails.patch, 0001-Wait-for-anticompaction-to-finish.patch
> I'm trying to synchronize incremental repairs over multiple nodes in a Cassandra cluster,
and it does not seem to easily achievable.
> In principle, the process iterates through the nodes of the cluster and performs `nodetool
-h $NODE repair --incremental`, but that sometimes fails on subsequent nodes. The reason for
failing seems to be that the repair returns as soon as the repair and the _local_ anticompaction
has completed, but does not guarantee that remote anticompactions are complete. If I subsequently
try to issue another repair command, they fail to start (and terminate with failure after
about one minute). It usually isn't a problem, as the local anticompaction typically involves
as much (or more) data as the remote ones, but sometimes not.

This message was sent by Atlassian JIRA

View raw message