cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Jirsa <jji...@gmail.com>
Subject Re: Recover lost node from backup or evict/re-add?
Date Thu, 13 Jun 2019 13:40:44 GMT


> On Jun 13, 2019, at 6:29 AM, Oleksandr Shulgin <oleksandr.shulgin@zalando.de> wrote:
> 
>> On Thu, Jun 13, 2019 at 3:16 PM Jeff Jirsa <jjirsa@gmail.com> wrote:
> 
>> On Jun 13, 2019, at 2:52 AM, Oleksandr Shulgin <oleksandr.shulgin@zalando.de>
wrote:
>> On Wed, Jun 12, 2019 at 4:02 PM Jeff Jirsa <jjirsa@gmail.com> wrote:
>>>> To avoid violating consistency guarantees, you have to repair the replicas
while the lost node is down
>>> 
>>> How do you suggest to trigger it?  Potentially replicas of the primary range
for the down node are all over the local DC, so I would go with triggering a full cluster
repair with Cassandra Reaper.  But isn't it going to fail because of the down node?  
>> Im not sure there’s an easy and obvious path here - this is something TLP may want
to enhance reaper to help with. 
>> 
>> You have to specify the ranges with -st/-et, and you have to tell it to ignore the
down host with -hosts. With vnodes you’re right that this may be lots and lots of ranges
all over the ring.
>> 
>> There’s a patch proposed (maybe committed in 4.0) that makes this a nonissue by
allowing bootstrap to stream one repaired set and all of the unrepaired replica data (which
is probably very small if you’re running IR regularly), which accomplished the same thing.
> 
> Ouch, it really hurts to learn this. :(
>>> It is also documented (I believe) that one should repair the node after it finishes
the "replace address" procedure.  So should one repair before and after?
>> You do not need to repair after the bootstrap if you repair before. If the docs say
that, they’re wrong. The joining host gets writes during bootstrap and consistency levels
are altered during bootstrap to account for the joining host.
> 
> This is what I had in mind (what makes replacement different from actual bootstrap of
a new node):

Bootstrapping a new node does not require repairs at all.

Replacing a node only requires repairs to guarantee consistency to avoid violating quorum
because streaming for bootstrap only streams from one replica

Think this way:

Host 1, 2, 3 in a replica set
You write value A to some key
It lands on hosts 1 and 3. Host 2 was being restarted or something
Host 2 comes back up
Host 3 fails

If you replace 3 with 3’ - 
3’ May stream from host 1 and now you’ve got a quorum if replicas with A
3’ may stream fr host 2, and now you’ve got a quorum if replicas without A. This is illegal.

This is just a statistics game - do you have hosts missing writes? If so, are hints delivering
them when those hosts come back? What’s the cost of violating consistency in that second
scenario to you? 

If you’re running something where correctness really really really matters, you must repair
first. If you’re actually running a truly eventual consistency use case and reading stale
writes is fine, you probably won’t ever notice.  

In any case these docs are weird and wrong - joining nodes get writes in all versions of Cassandra
for the past few years (at least 2.0+), so the docs really need to be fixed.

> http://cassandra.apache.org/doc/latest/operating/topo_changes.html?highlight=replace%20address#replacing-a-dead-node

> Note
> If any of the following cases apply, you MUST run repair to make the replaced node consistent
again, since it missed ongoing writes during/prior to bootstrapping. The replacement timeframe
refers to the period from when the node initially dies to when a new node completes the replacement
process.
> 
> The node is down for longer than max_hint_window_in_ms before being replaced.
> You are replacing using the same IP address as the dead node and replacement takes longer
than max_hint_window_in_ms.
> 
> I would imagine that any production size instance would take way longer to replace than
the default max hint window (which is 3 hours, AFAIK).  Didn't remember the same IP restriction,
but at least this I would also expect to be the most common setup.
> 
> --
> Alex
> 

Mime
View raw message