cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Jirsa (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing
Date Thu, 29 Jun 2017 19:18:00 GMT


Jeff Jirsa updated CASSANDRA-13480:
       Resolution: Fixed
    Fix Version/s:     (was: 4.x)
    Reproduced In: 3.0.13, 2.1.16, 4.x  (was: 2.1.16, 3.0.13, 4.x)
           Status: Resolved  (was: Ready to Commit)

Thanks all, committed into 4.0 as {{20d5ce8b9b587be2f0b7bc5765254e8dc6e0bd3b}}

> nodetool repair can hang forever if we lose the notification for the repair completing/failing
> ----------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-13480
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Matt Byrd
>            Assignee: Matt Byrd
>            Priority: Minor
>              Labels: repair
>             Fix For: 4.0
> When a Jmx lost notification occurs, sometimes the lost notification in question is the
notification which let's RepairRunner know that the repair is finished (ProgressEventType.COMPLETE
or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> To fix this, If on receiving a notification that notifications have been lost (JMXConnectionNotification.NOTIFS_LOST),
we instead query a new endpoint via Jmx to receive all the relevant notifications we're interested
in, we can replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself might be lost
and so for good measure I have made RepairRunner poll periodically to see if there were any
notifications that had been sent but we didn't receive (scoped just to the particular tag
for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new endpoint
and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different approach, I also
tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios but in
this test we don't even send that many notifications so I'm not surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with jmx as far
as I can tell.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message