cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ZhaoYang (Jira)" <>
Subject [jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases
Date Fri, 06 Mar 2020 12:08:00 GMT


ZhaoYang commented on CASSANDRA-15566:

Hi, [~dcapwell] Have you started working on this ticket? I am happy to help..

> Repair coordinator can hang under some cases
> --------------------------------------------
>                 Key: CASSANDRA-15566
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0-beta
> Repair coordination makes a few assumptions about message delivery which cause it to
hang forever when those assumptions don’t hold true: fire and forget will not get rejected
(participate has an issue and rejects the message), and a very delayed message will one day
be seen (messaging can be dropped under load or when failure detector thinks a node is bad
but is just GCing).
> Given this and the desire to have better observability with repair (see CASSANDRA-15399),
coordination should be changed into a request/response pattern (with retries) and polling
(validation status and MerkleTree sending).  This would allow the coordinator to detect changes
in state (it was known participate was working on validation, but it no longer knows about
the validation task), and to be able to recover from ephemeral issues.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message