cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Jirsa (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13308) Gossip breaks, Hint files not being deleted on nodetool decommission
Date Fri, 17 Mar 2017 20:43:41 GMT


Jeff Jirsa commented on CASSANDRA-13308:

So we'll just grab the future, cancel if we're running, remove it from the map of tasks, and
let the cleanup continue:

|| branch || utests || dtests ||
| [3.0|] | [testall|]
| [dtest|] |
| [3.11|] | [testall|]
| [dtest|] |
| [trunk|] | [testall|]
| [dtest|] |

New dtest demonstrating this failure mode is @
(and dtests linked in this branch have been started against this dtest repo/branch). 

> Gossip breaks, Hint files not being deleted on nodetool decommission
> --------------------------------------------------------------------
>                 Key: CASSANDRA-13308
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: Using Cassandra version 3.0.9
>            Reporter: Arijit
>            Assignee: Jeff Jirsa
>         Attachments: 28207.stack, logs, logs_decommissioned_node
> How to reproduce the issue I'm seeing:
> Shut down Cassandra on one node of the cluster and wait until we accumulate a ton of
hints. Start Cassandra on the node and immediately run "nodetool decommission" on it.
> The node streams its replicas and marks itself as DECOMMISSIONED, but other nodes do
not seem to see this message. "nodetool status" shows the decommissioned node in state "UL"
on all other nodes (it is also present in system.peers), and Cassandra logs show that gossip
tasks on nodes are not proceeding (number of pending tasks keeps increasing). Jstack suggests
that a gossip task is blocked on hints dispatch (I can provide traces if this is not obvious).
Because the cluster is large and there are a lot of hints, this is taking a while. 
> On inspecting "/var/lib/cassandra/hints" on the nodes, I see a bunch of hint files for
the decommissioned node. Documentation seems to suggest that these hints should be deleted
during "nodetool decommission", but it does not seem to be the case here. This is the bug
being reported.
> To recover from this scenario, if I manually delete hint files on the nodes, the hints
dispatcher threads throw a bunch of exceptions and the decommissioned node is now in state
"DL" (perhaps it missed some gossip messages?). The node is still in my "system.peers" table
> Restarting Cassandra on all nodes after this step does not fix the issue (the node remains
in the peers table). In fact, after this point the decommissioned node is in state "DN"

This message was sent by Atlassian JIRA

View raw message