lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-8069) Leader Initiated Recovery can put the replica with the latest data into LIR and a shard will have no leader even on restart.
Date Fri, 18 Sep 2015 13:25:04 GMT

    [ https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14866031#comment-14866031
] 

Mark Miller commented on SOLR-8069:
-----------------------------------

bq. thus I prefer the simple logic of "do this action only if our zookeeper session state
is exactly what it was when we decided to do it". Anyhow, this is probably beyond the scope
of this JIRA.

I don't see an easy way to do that in this case. Almost all the solutions that fit with the
code have the exact same holes / races. I think the local leader check around getting the
leader context is the strongest thing I can think of so far other than adding further defensive
checks.

I don't know that much more is needed though. If the context returned is from the leader,
great, its zkparentversion will will match. If the context is somehow not the right one, it
won't match. We get a context and only if it's the context for the leader in ZK do we do anything
rather than just if the context has a node in line. I'd say that is a pretty strong improvement.

This should only work the node is a valid leader by it's local state and by ZooKeeper.

> Leader Initiated Recovery can put the replica with the latest data into LIR and a shard
will have no leader even on restart.
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8069
>                 URL: https://issues.apache.org/jira/browse/SOLR-8069
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>         Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation where the
rightful leader was put or put itself into LIR. Even on restart, this rightful leader won't
take leadership and you have to manually clear the LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR should just
be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message