cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-2433) Failed Streams Break Repair
Date Tue, 30 Aug 2011 16:37:38 GMT


Sylvain Lebresne commented on CASSANDRA-2433:

bq. Why do we need the new AE_SESSIONS stage?

If you mean "why AE_SESSIONS when we already have the AE stage?", then it is because repair
push stuffs on the AE stage that it wait for, so we would deadlock. If you mean "why a stage?",
it felt cleaner that just a Thread now that we want to check for exception at the end of the
exception. If you mean "why a stage rather than a simple ThreadExecutor?", it is a good question.
I guess it was just some reflex of mine to get a JMXEnabledThreadPool, but it's probably not
worth a stage, not even the jmx enabledness maybe.

bq. I prefer using WrappedRunnable to a Callable when you want to allow exceptions but don't
care about a return value

Agreed. I'll update the patch.

bq. I think we can avoid a bunch of no-op onConvicts if RepairSession were to subscribe to
FD directly instead of going through Gossip

Yeah, I kind of started with that but the problem is that we must deal with the case of a
node restarting before it has been convicted (especially if the conviction threshold is higher),
which the FD won't see. We could deal of that last situation separately and have Gossip call
some trigger into AntiEntropy on a gossip generation change to indicate to stop every started
session involving the given endpoint, but creating a dependency of gossip to anti-entropy
didn't felt like a good idea a priori.

> Failed Streams Break Repair
> ---------------------------
>                 Key: CASSANDRA-2433
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.5
>         Attachments: 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v4.patch,
0002-Register-in-gossip-to-handle-node-failures-v4.patch, 0003-Report-streaming-errors-back-to-repair-v4.patch,
0004-Reports-validation-compaction-errors-back-to-repair-v4.patch, 2433.patch, 2433_v2.patch
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to clean itself
up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up the data
> These issues together are making repair very difficult for nearly everyone running repair
on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was moved to 0.8
for a few reasons. This ticket is to fix the immediate issues that we are seeing in 0.7.

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message