cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleg Dulin (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-5396) Repair process is a joke leading to a downward spiralling and eventually unusable cluster
Date Fri, 01 Nov 2013 19:18:18 GMT


Oleg Dulin commented on CASSANDRA-5396:

You know this ticket is still valid.

* repair is fragile. Just now I ran a repair -pr on a node with 50Gigs of data and it froze
within only 30 seconds!
* when i restated the repair, it completed in 2 mins.

I asked on IRC and someone said "Oh, by the way, silly Oleg, repair -pr in DC2 will only handle
the partitioner range for difference between tokens in DC1 and DC2". Silly me. There is no
documentation that explains it -- neither open source, nor Datastax.

We are running real world applications here.

> Repair process is a joke leading to a downward spiralling and eventually unusable cluster
> -----------------------------------------------------------------------------------------
>                 Key: CASSANDRA-5396
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.3
>         Environment: all
>            Reporter: David Berkman
>            Priority: Critical
> Let's review the repair process...
> 1) It's mandatory to run repair.
> 2) Repair has a high impact and can take hours.
> 3) Repair provides no estimation of completion time and no progress indicator.
> 4) Repair is extremely fragile, and can fail to complete, or become stuck quite easily
in real operating environments.
> 5) When repair fails it provides no feedback whatsoever of the problem or possible resolution.
> 6) A failed repair operation saddles the effected nodes with a huge amount of extra data
(judging from node size).
> 7) There is no way to rid the node of the extra data associated with a failed repair
short of completely rebuilding the node.
> 8) The extra data from a failed repair makes any subsequent repair take longer and increases
the likelihood that it will simply become stuck or fail, leading to yet more node corruption.
> 9) Eventually no repair operation will complete successfully, and node operations will
eventually become impacted leading to a failing cluster.
> Who would design such a system for a service meant to operate as a fault tolerant clustered
data store operating on a lot of commodity hardware?
> Solution...
> 1) Repair must be robust.
> 2) Repair must *never* become 'stuck'.
> 3) Failure to complete must result in reasonable feedback.
> 4) Failure to complete must not result in a node whose state is worse than before the
operation began.
> 5) Repair must provide some means of determining completion percentage.
> 6) It would be nice if repair could estimate its run time, even if it could do so only
based upon previous runs.

This message was sent by Atlassian JIRA

View raw message