cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Byrd (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing
Date Fri, 28 Apr 2017 23:02:04 GMT


Matt Byrd commented on CASSANDRA-13480:

So the patch I have currently also caches the notifications for repairs for a limited time
on the co-ordinator, it was initially targeting a release where we didn't yet have the repair
history tables.
I suppose there is a concern that caching these notifications could under some circumstances
cause unwanted extra heap usage. 
(Similarly to the notifications buffer, although at least here we're only caching a subset
that we care more about)
So using the repair history tables instead and exposing this information by imx seems like
a reasonable alternative.
There are perhaps a couple of kinks to work out, but I'll have a go at adapting the patch
that I have to work in this way.
For one we only have the cmd id int sent back to the nodetool process (rather than the parent
session id which the internal table is partition keyed off)
We could either keep track of the cmd id int -> parent session uuid in the co-ordinator,
either in memory cached to expire or in another internal table,
or we could parse the uuid out of the notification sent for the start of the parent repair.
Parsing the message is a bit brittle though and not full proof in theory (we could miss that
notification also).
Ideally I suppose running a repair could return and communicate on the basis of the parent
session uuid rather than the int cmd id, but this is a pretty major overhaul and has all sorts
of compatibility questions.

> nodetool repair can hang forever if we lose the notification for the repair completing/failing
> ----------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-13480
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Matt Byrd
>            Assignee: Matt Byrd
>            Priority: Minor
>             Fix For: 4.x
> When a Jmx lost notification occurs, sometimes the lost notification in question is the
notification which let's RepairRunner know that the repair is finished (ProgressEventType.COMPLETE
or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> To fix this, If on receiving a notification that notifications have been lost (JMXConnectionNotification.NOTIFS_LOST),
we instead query a new endpoint via Jmx to receive all the relevant notifications we're interested
in, we can replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself might be lost
and so for good measure I have made RepairRunner poll periodically to see if there were any
notifications that had been sent but we didn't receive (scoped just to the particular tag
for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new endpoint
and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different approach, I also
tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios but in
this test we don't even send that many notifications so I'm not surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with jmx as far
as I can tell.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message