lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ishan Chattopadhyaya (JIRA)" <>
Subject [jira] [Updated] (SOLR-7989) Down replica elected leader, stays down after successful election
Date Wed, 11 Nov 2015 05:48:11 GMT


Ishan Chattopadhyaya updated SOLR-7989:
    Attachment: SOLR-7569.patch

bq. Shouldn't we just publish active regardless?
That's what I wanted to do in my initial patch. Though, upon's Noble's comment to add the
check, I thought it would help reduce one overseer message and be more efficient.

bq. Why do we use the stale clusterstate to see if we are already active and prevent publishing
active if we are not?
What do you think we should do, do you suggest (1) we force update the cluster state before
the check so that we don't check against stale clusterstate, or (2) send the active state
message regardless?

Attaching the patch for (1), this required a change to the LeaderElectionTest. 

To do (2), it would require a change to OverseerTest.testOverseerStatsReset (SOLR-8249), and
I don't currently know how to make it work if the STATE=ACTIVE message is sent regardless.
If that's the way you suggest we should go, maybe I could raise a patch to send the message
without a state check and disable the test for now.

> Down replica elected leader, stays down after successful election
> -----------------------------------------------------------------
>                 Key: SOLR-7989
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Noble Paul
>             Fix For: 5.4, Trunk
>         Attachments:,, SOLR-7569.patch, SOLR-7989.patch,
SOLR-7989.patch, SOLR-7989.patch, SOLR-7989.patch, SOLR-8233.patch
> It is possible that a down replica gets elected as a leader, and that it stays down after
the election.
> Here's how I hit upon this:
> * There are 3 replicas: leader, notleader0, notleader1
> * Introduced network partition to isolate notleader0, notleader1 from leader (leader
puts these two in LIR via zk).
> * Kill leader, remove partition. Now leader is dead, and both of notleader0 and notleader1
are down. There is no leader.
> * Remove LIR znodes in zk.
> * Wait a while, and there happens a (flawed?) leader election.
> * Finally, the state is such that one of notleader0 or notleader1 (which were down before)
become leader, but stays down.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message