cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shao-Chuan Wang <shaochuan.w...@bloomreach.com>
Subject Re: nodetool removenode hangs - for everyone?
Date Fri, 24 Jan 2014 17:02:41 GMT
We were seeing this issue on 2.0.1, with 9 nodes in one DC, 12 nodes in
another DC. Each DC has replication factor of 3 for all keyspaces. Does
anyone know how to work this around and make nodetool removenode work?


On Fri, Jan 24, 2014 at 6:50 AM, Andrew Losey <andrew@addthis.com> wrote:

> The problem described in ticket 6542,
> https://issues.apache.org/jira/browse/CASSANDRA-6542, has been observed
> in my environment. This isn't a new problem, as it's been seen across
> several differently sized, vnode enabled, clusters for much longer than the
> age of the ticket. The problem has definitely been hanging around since
> 1.2.11 (we're on 1.2.12), and likely longer than that.
>
> About 10% of the time, depending on the size of a cluster, 'removenode'
> works. 'removenode status' will slowly report a decrement to the list of
> IPs in 'removenode status'.
>
> Typical output looks like this:
>
> "RemovalStatus: Removing token (1133935256116267454566500603062154024).
> Waiting for replication confirmation from
> [/xxx.xxx.xxx.xxx,/xxx.xxx.xxx.xxx,/etc,/etc]"
>
> And likewise, 'nodetool status' on each node shows the node-to-be-removed
> as DownLeaving status. As a replication confirmation comes through, an IP
> disappears from the waiting list and is no longer listed in 'nodetool
> status' on that respective node.
>
> But this rarely works the way it's supposed to. Typically, one or two
> nodes offer their replication confirmation and then, as described in the
> ticket, nothing else happens. After hours or even days of waiting, you have
> to use 'nodetool removenode force' to complete the process.
>
> Does this happen for everyone? If it does, what versions are you running?
> What's the size of your cluster? Any log entries observed that indicate
> there's a problem with the process? Are there any rain dances people do to
> make removenode work the first time? Maybe we can get a bump in visibility
> on this issue.

Mime
View raw message