lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <>
Subject Avoid losing data on ZK connection-loss/session-timeout
Date Tue, 21 Aug 2012 07:23:00 GMT

Accidently started a discussion around SUBJECT on issue SOLR-3721. Not 
to mix things up too much I will encourage that we continue the 
discussion here. It is an important issue (at least for my 
organization), and I believe the current solution in Solr 4.x is not 
solid enough (have seen it in practice on high-load/high-concurrency 
setups). I will start by quoting the sessential stuff around SUBJECT 
from SOLR-3721. Hope you Solr devs (and other interested folkes) will 
join the discussion.

Regards Per Steffensen

------------------------ quotings from SOLR-3721 
*Per Steffensen:*
What if two Solrs, respectively running leader and replica for the same 
slice (only one replica), lose their ZK connection at about the same 
time. Then there will be no active shard that either of them can recover 
from. This scenario shouldnt end in a situation where the slice is just 
dead. The two shards in the same slice ought to find out who has the 
newest version of the shard-data (will probably be the one that was 
leader last), make that shard the leader (without recovering) and let 
the other shard recover from it. Is this scenarios handled (in the way I 
suggest or in another way) already in Solr 4.0 (beta - tip of branch) or 
is that a future thing (e.g. on 4.1 or 5.0)?

*Mark Miller:*
*1)* No recovery will be started if a node cannot talk to zookeeper.

So nothing would happen until one or both of the nodes reconnected to 
ZooKeeper. That would trigger a leader election, that leader node would 
attempt to sync up with all the other nodes for that shard and any 
recoveries would procede against him.

A little more detail on the "leader attempts to sync up":

*2)* When a new node is elected as a leader by ZooKeeper it first tries 
to do a peer sync against every other live node. So lets say the first 
node in your two node situation comes back and he is behind the other 
node, but he comes back first and is elected leader. The second node has 
the latest updates, but is second in line to be leader and a few updates 
ahead. The potential leader will try and peer sync with the other node 
and get those missing updates if it's fewer than 100 or fail because the 
other node is ahead by too much.

If the peer sync is a fail, the potential leader will give up his leader 
role, realizing that it seems there is a better candidate. The other 
node, being the next in line to be leader, will now try and peer sync 
with the other nodes in the shard. In this case, that will be a success 
since he is ahead of the first node. He will then ask the other nodes to 
peer sync to him. If they are less than 100 docs behind, it will 
succeed. If any sync back attempts fail, the leader tries to ask them to 
recover and they will replicate. Only after this sync process is 
completed does the leader advertise that he is now the leader in the 
cloud state.

That is the current process - we will continually be hardening and 
improving it I'm sure.

*Per Steffensen:*
*Ad 1) *Well, I knew that. I meant that the two Solrs where disconnected 
from ZK at the same time, but of course both got their connection 
reestablished - after session timeout (believe (kinda hope) that a 
session timeout has to have happened before Solr needs to go into 
recovery after a ZK connection loss)

*Ad 2)* When the "behind" node has reconnected and become leader and the 
one with the latest updates does not come back live right away, isnt the 
new leader (which is behind) allowed to start handling update-requests. 
If yes, then it will be possible that both shards have documents/updates 
that the other one doesnt, and it is possible to come up with scenarios 
where there is no good algorithm for generating the "correct" merged 
union of the data in both shards. So what to do when the other shard 
(which used to have a later version than the current leader) comes live?
*3)* Believe there is nothing solid to do!
How to avoid that? I was thinking about keeping the latest version for 
every slice in ZK, so that a "behind" shard will know if it has the 
latest version of a slice, and therefore if it is allowed to take the 
role as leader. Of course the writing of this "latest version" to ZK and 
the writing of the corresponding update in leaders transaction-log would 
have to be atomic (like the A in ACID) as much as possible. And it would 
be nice if writing of the update in replica transaction-log would also 
be atomic with the leader-writing and the ZK writing, in order to 
increase the chance that a replica is actually allowed to take over the 
leader role if the leader dies (or both dies and replica comes back 
first, and "old" leader comes back minutes later). But all that is just 
an idea on top of my head.
Do you already have a solution implemented or a solution on the drawing 
board or how do you/we prevent such a problem? As far as I understand 
"the drill" during leader-election/recovery (whether its peer-sync or 
file-copy-replication) from the little code-reading I have done and from 
what you explain, there is not a current solution. But I might be wrong?

*Mark Miller:*
*Ad 3)* Well, we can do some practical things right? I don't think we 
need to support a node coming back from the dead a year later and it had 
some updates the cluster doesn't have. A node coming up 2 minutes later 
is something we want to worry about though.
So basically we either need something timing based or admin command 
based that lets you start a cold shard (slice :-) ) and each node waits 
around for X amount of time or until command X is received, and then 
leader election begins.

*Jan H√łydahl:*
ElasticSearch has some settings to control when recovery starts after 
cluster restart, see Guide. This approach looks reasonable. If we know 
that we expect N nodes in our cluster we can start recovery when we see 
N nodes up. If fewer than N nodes up, we wait for X time (running on 
local data, not accepting new updates) before recovery and leader 
election starts.

View raw message