cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Byrd (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing
Date Fri, 28 Apr 2017 01:20:04 GMT
Matt Byrd created CASSANDRA-13480:
-------------------------------------

             Summary: nodetool repair can hang forever if we lose the notification for the
repair completing/failing
                 Key: CASSANDRA-13480
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
             Project: Cassandra
          Issue Type: Bug
          Components: Tools
            Reporter: Matt Byrd
            Assignee: Matt Byrd
            Priority: Minor
             Fix For: 4.x


When a Jmx lost notification occurs, sometimes the lost notification in question is the notification
which let's RepairRunner know that the repair is finished (ProgressEventType.COMPLETE or even
ERROR for that matter).
This results in nodetool process running the repair hanging forever. 

I have a test which reproduces the issue here:
https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test

To fix this, If on receiving a notification that notifications have been lost (JMXConnectionNotification.NOTIFS_LOST),
we instead query a new endpoint via Jmx to receive all the relevant notifications we're interested
in, we can replay those we missed and avoid this scenario.

It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself might be lost and
so for good measure I have made RepairRunner poll periodically to see if there were any notifications
that had been sent but we didn't receive (scoped just to the particular tag for the given
repair).

Users who don't use nodetool but go via jmx directly, can still use this new endpoint and
implement similar behaviour in their clients as desired.
I'm also expiring the notifications which have been kept on the server side.
Please let me know if you've any questions or can think of a different approach, I also tried
setting:
 JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
but this didn't fix the test. I suppose it might help under certain scenarios but in this
test we don't even send that many notifications so I'm not surprised it doesn't fix it.
It seems like getting lost notifications is always a potential problem with jmx as far as
I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message