cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent White (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13797) RepairJob blocks on syncTasks
Date Thu, 01 Mar 2018 23:18:00 GMT


Vincent White commented on CASSANDRA-13797:

Now that we don't wait for the validations of each repair job to finish before moving onto
the next one, I don't see anything to stop the repair coordinator from spinning through all
the token ranges and effectively triggering all the validations tasks at once, which could
be a significant amount of validation compactions on each node depending on your topology
and common ranges for that keyspace. I'm also not sure of the overhead of creating all the
futures/listeners on the coordinator at once in this case. 

In 3 the validation executor thread pool has no size limit so a new validation is started
as soon as a validation request is received. I admit I haven't caught up on the changes to
repair in trunk, and while the validation executor pool size is configurable in trunk, its
default is still Integer.MAX_VALUE.

I understand this same affect (hundreds of concurrent validations) can still happen if you
trigger a repair across a keyspace with a large number of column families but with this change
there is no way of avoiding it without using subrange repairs on a single column family (if
you have a topology/replication that cant be merged into a small number of common ranges)

> RepairJob blocks on syncTasks
> -----------------------------
>                 Key: CASSANDRA-13797
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Repair
>            Reporter: Blake Eggleston
>            Assignee: Blake Eggleston
>            Priority: Major
>             Fix For: 3.0.15, 3.11.1, 4.0
> The thread running {{RepairJob}} blocks while it waits for the validations it starts
to complete ([see here|]).
However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't waiting for {{RepairJob#run}}
to return, they're waiting for a result to be set on RepairJob the future, which happens after
the sync tasks have completed. This post repair cleanup stuff also immediately shuts down
the executor {{RepairJob#run}} is running in. So in noop repair sessions, where there's nothing
to stream, I'm seeing the callbacks sometimes fire before {{RepairJob#run}} wakes up, and
causing an {{InterruptedException}} is thrown.
> I'm pretty sure this can just be removed, but I'd like a second opinion. This appears
to just be a holdover from before repair coordination became async. I thought it might be
doing some throttling by blocking, but each repair session gets it's own executor, and validation
is  throttled by the fixed size executors doing the actual work of validation, so I don't
think we need to keep this around.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message