cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Losey <and...@addthis.com>
Subject nodetool removenode hangs - for everyone?
Date Fri, 24 Jan 2014 14:50:56 GMT
The problem described in ticket 6542, https://issues.apache.org/jira/browse/CASSANDRA-6542,
has been observed in my environment. This isn't a new problem, as it's been seen across several
differently sized, vnode enabled, clusters for much longer than the age of the ticket. The
problem has definitely been hanging around since 1.2.11 (we're on 1.2.12), and likely longer
than that.

About 10% of the time, depending on the size of a cluster, 'removenode' works. 'removenode
status' will slowly report a decrement to the list of IPs in 'removenode status'. 

Typical output looks like this:

"RemovalStatus: Removing token (1133935256116267454566500603062154024). Waiting for replication
confirmation from [/xxx.xxx.xxx.xxx,/xxx.xxx.xxx.xxx,/etc,/etc]" 

And likewise, 'nodetool status' on each node shows the node-to-be-removed as DownLeaving status.
As a replication confirmation comes through, an IP disappears from the waiting list and is
no longer listed in 'nodetool status' on that respective node.

But this rarely works the way it's supposed to. Typically, one or two nodes offer their replication
confirmation and then, as described in the ticket, nothing else happens. After hours or even
days of waiting, you have to use 'nodetool removenode force' to complete the process.

Does this happen for everyone? If it does, what versions are you running? What's the size
of your cluster? Any log entries observed that indicate there's a problem with the process?
Are there any rain dances people do to make removenode work the first time? Maybe we can get
a bump in visibility on this issue.
Mime
View raw message