cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Jirsa (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13968) Cannot replace a live node on large clusters
Date Fri, 20 Oct 2017 23:51:00 GMT


Jeff Jirsa commented on CASSANDRA-13968:

Tagging [~jasobrown] and [~jkni] who have touched gossip quite a bit more than the rest of
us - either of you feel like you're interested in reviewing?

> Cannot replace a live node on large clusters
> --------------------------------------------
>                 Key: CASSANDRA-13968
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.1.17, Ubuntu Trusty/Xenial (Linux 3.13, 4.4)
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>              Labels: gossip
>         Attachments: 0001-During-node-replacement-check-for-updates-in-the-tim.patch,
> During forced node replacements we very frequently (~every time for large clusters) see:
> {noformat}
> ERROR [main] 2017-10-17 06:54:35,680 - Exception encountered
during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}
> The old node is dead, the new node that is replacing it thinks it is dead (DN state),
and all other nodes think it is dead (all have the DN state). However, I believe there are
two bugs in the "is live" check that can cause this error, namely that:
> 1. We sleep for [BROADCAST_INTERVAL|]
(hardcoded 60s on 2.1, on later version configurable but still 60s by default), but [check|]
for an update in the last RING_DELAY seconds (typically set to 30s). When a fresh node is
joining, in my experience, [the schema|]
check almost immediately returns true after gossiping with seeds, so in reality we do not
even sleep for RING_DELAY. If operators increase ring delay past broadcast_interval (as you
might do if you think you are victim to the second bug below), then you guarantee that you
will always get the exception because the gossip update is basically guaranteed to happen
in the last RING_DELAY seconds since you didn't sleep for that duration (you slept for broadcast).
For example if an operator sets ring delay to 300s, then the check says "oh yea, the last
update was 59 seconds ago, which is sooner than 300s, so fail".
> 2. We don't actually check that the node is alive, we just check that a gossip update
has happened in the last X seconds. Sometimes with large clusters nodes are still converging
on the proper generation/version of a dead node, and the "is live" check prevents an operator
from replacing the node until gossip has settled on the cluster regarding the dead node, which
for large clusters can take a really long time. This can be really hurtful to availability
in cloud environments and every time I've seen this error it's the case that the new node
believes that the old node is down (since [markAlive|]
[marks dead|]
first and then triggers a callback to [realMarkAlive|]
which never triggers because the old node is actually down).
> I think that #1 is definitely a bug, #2 might be considered an extra safety" feature
(that you don't allow replacement during gossip convergence), but given that the operator
took the effort to supply the replace_address flag, I think it's prudent to only fail if we
really know something is wrong.
> I've attached two patches against 2.1, one that fixes bug #1 and one that fixes (imo)
bug #2. I was thinking for #1 that we may want to prevent the schema check from exiting the
RING_DELAY sleep early but maybe it's just better to backport configurable broadcast_interval
and pick the maximum or something. If we don't like the way I've worked around #2, maybe I
could make it an option that operators could turn on if they wanted? If folks are happy with
the approach I can attach patches for 2.2, 3.0, and 3.11.
> A relevant example of a log showing the first bug (in this case the node that was being
replaced was drained moving it to shutdown before replacement, and ring delay was forced to
300s because the cluster is very large):
> {noformat}
> INFO  [ScheduledTasks:1] 2017-10-17 06:41:21,325 - Overriding
RING_DELAY to 300000ms
> INFO  [main] 2017-10-17 06:41:24,240 - Gathering node replacement
information for /OLD_ADDRESS
> INFO  [GossipStage:1] 2017-10-17 06:41:25,198 - Node /OLD_ADDRESS
is now part of the cluster
> INFO  [GossipStage:1] 2017-10-17 06:41:25,200 - InetAddress /OLD_ADDRESS
is now DOWN
> INFO  [main] 2017-10-17 06:41:25,617 - JOINING: waiting for
ring information
> INFO  [main] 2017-10-17 06:41:29,618 - JOINING: schema complete,
ready to bootstrap
> INFO  [main] 2017-10-17 06:41:29,618 - JOINING: waiting for
pending range calculation
> INFO  [main] 2017-10-17 06:41:29,718 - JOINING: calculation
complete, ready to bootstrap
> INFO  [GossipStage:1] 2017-10-17 06:41:30,606 - Node /OLD_ADDRESS
is now part of the cluster
> INFO  [GossipStage:1] 2017-10-17 06:41:30,606 - Updating topology
> INFO  [GossipStage:1] 2017-10-17 06:41:30,607 - Updating topology
> INFO  [GossipStage:1] 2017-10-17 06:41:30,614 - InetAddress /OLD_ADDRESS
is now DOWN
> ERROR [main] 2017-10-17 06:42:29,722 - Exception encountered
during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message