cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleksandr Shulgin <>
Subject Re: Recover lost node from backup or evict/re-add?
Date Thu, 13 Jun 2019 13:29:22 GMT
On Thu, Jun 13, 2019 at 3:16 PM Jeff Jirsa <> wrote:

> On Jun 13, 2019, at 2:52 AM, Oleksandr Shulgin <
>> wrote:
> On Wed, Jun 12, 2019 at 4:02 PM Jeff Jirsa <> wrote:
> To avoid violating consistency guarantees, you have to repair the replicas
>> while the lost node is down
> How do you suggest to trigger it?  Potentially replicas of the primary
> range for the down node are all over the local DC, so I would go with
> triggering a full cluster repair with Cassandra Reaper.  But isn't it going
> to fail because of the down node?
> Im not sure there’s an easy and obvious path here - this is something TLP
> may want to enhance reaper to help with.
> You have to specify the ranges with -st/-et, and you have to tell it to
> ignore the down host with -hosts. With vnodes you’re right that this may be
> lots and lots of ranges all over the ring.
> There’s a patch proposed (maybe committed in 4.0) that makes this a
> nonissue by allowing bootstrap to stream one repaired set and all of the
> unrepaired replica data (which is probably very small if you’re running IR
> regularly), which accomplished the same thing.

Ouch, it really hurts to learn this. :(

> It is also documented (I believe) that one should repair the node after it
> finishes the "replace address" procedure.  So should one repair before and
> after?
> You do not need to repair after the bootstrap if you repair before. If the
> docs say that, they’re wrong. The joining host gets writes during bootstrap
> and consistency levels are altered during bootstrap to account for the
> joining host.

This is what I had in mind (what makes replacement different from actual
bootstrap of a new node):


If any of the following cases apply, you MUST run repair to make the replaced
node consistent again, since it missed ongoing writes during/prior to
bootstrapping. The *replacement* timeframe refers to the period from when
the node initially dies to when a new node completes the replacement

   1. The node is down for longer than max_hint_window_in_ms before being
      2. You are replacing using the same IP address as the dead node and
      replacement takes longer than max_hint_window_in_ms.

I would imagine that any production size instance would take way longer to
replace than the default max hint window (which is 3 hours, AFAIK).  Didn't
remember the same IP restriction, but at least this I would also expect to
be the most common setup.


View raw message