cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lohfink (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-13480) nodetool repair can hang forever if we lose the notification for the repair completing/failing
Date Fri, 28 Apr 2017 02:52:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988094#comment-15988094
] 

Chris Lohfink edited comment on CASSANDRA-13480 at 4/28/17 2:51 AM:
--------------------------------------------------------------------

how many notifications you see doesnt impact the notification buffer. JMX will create a buffer
of notifications and cycle through them indexing new events as they are created. The JMX client
will request events with the last index it has seen. Since the server does not store the state
of the clients or know what they are listening for, ALL events regardless of listening state
are appended into buffer. Even if nothing is listening to them all the storage notifications,
the streaming notifications, the jvm hotspot notifications are being pushed onto that buffer.
If your client takes too long between polling it will get lost notifications (and it will
tell you how many it lost). 5000 still may not be nearly enough, but its gonna cost the heap
dearly to make that value too large.

Nodetool actually used to just shut down on lost notifications, but in some clusters/workloads
its almost impossible for a client to keep up. In CASSANDRA-7909 it was just starting to be
logged. Querying a different endpoint wouldnt really help, only the repair coordinator has
the events and it doesnt keep it around (and its cycled outta buffer). We could in theory
pull expose a JMX operation that checks the repair_history table or current repair states
to determine if the repair has been completed or errored out, and on lost notifications call
it to make sure we did not miss a complete event.


was (Author: cnlwsu):
how many notifications you see doesnt impact the notification buffer. JMX will create a buffer
of notifications and cycle through them indexing new events as they are created. The JMX client
will request events with the last index it has seen. Since the server does not store the state
of the clients or know what they are listening for, ALL events regardless of listening state
are appended into buffer. Even if nothing is listening to them all the storage notifications,
the streaming notifications, the jvm hotspot notifications are being pushed onto that buffer.
If your client takes too long between polling it will get lost notifications (and it will
tell you how many it lost). 5000 still may not be nearly enough, but its gonna cost the heap
dearly to make that value too large.

Nodetool actually used to just shut down on lost notifications, but in some clusters/workloads
its almost impossible for a client to keep up. In CASSANDRA-7909 it was just starting to be
logged. Querying a different endpoint wouldnt really help, only the repair coordinator has
the events and it doesnt keep it around (and its cycled outta buffer). We could in theory
pull expose a JMX operation that checks the repair_history table to determine if the repair
has been completed or errored out, and on lost notifications call it to make sure we did not
miss a complete event.

> nodetool repair can hang forever if we lose the notification for the repair completing/failing
> ----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13480
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13480
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Matt Byrd
>            Assignee: Matt Byrd
>            Priority: Minor
>             Fix For: 4.x
>
>
> When a Jmx lost notification occurs, sometimes the lost notification in question is the
notification which let's RepairRunner know that the repair is finished (ProgressEventType.COMPLETE
or even ERROR for that matter).
> This results in nodetool process running the repair hanging forever. 
> I have a test which reproduces the issue here:
> https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test
> To fix this, If on receiving a notification that notifications have been lost (JMXConnectionNotification.NOTIFS_LOST),
we instead query a new endpoint via Jmx to receive all the relevant notifications we're interested
in, we can replay those we missed and avoid this scenario.
> It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself might be lost
and so for good measure I have made RepairRunner poll periodically to see if there were any
notifications that had been sent but we didn't receive (scoped just to the particular tag
for the given repair).
> Users who don't use nodetool but go via jmx directly, can still use this new endpoint
and implement similar behaviour in their clients as desired.
> I'm also expiring the notifications which have been kept on the server side.
> Please let me know if you've any questions or can think of a different approach, I also
tried setting:
>  JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
> but this didn't fix the test. I suppose it might help under certain scenarios but in
this test we don't even send that many notifications so I'm not surprised it doesn't fix it.
> It seems like getting lost notifications is always a potential problem with jmx as far
as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message