lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-7021) Leader will not publish core as active without recovering first, but never recovers
Date Fri, 23 Jan 2015 09:46:35 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289025#comment-14289025
] 

Shalin Shekhar Mangar commented on SOLR-7021:
---------------------------------------------

James, this sounds suspiciously similar to SOLR-6530 which was fixed in 4.10.2. The root cause
is that some node marks a leader node as down via the leader-initiated-recovery logic because
a commit couldn't be sent to it.

> Leader will not publish core as active without recovering first, but never recovers
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-7021
>                 URL: https://issues.apache.org/jira/browse/SOLR-7021
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>            Priority: Critical
>              Labels: recovery, solrcloud, zookeeper
>
> A little background: 1 core solr-cloud cluster across 3 nodes, each with its own shard
and each shard with a single replica hence each replica is itself a leader. 
> For reasons we won't get into, we witnessed a shard go down in our cluster. We restarted
the cluster but our core/shards still did not come back up. After inspecting the logs, we
found this:
> {code}
> 015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - We are
http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is http://xxx.xxx.xxx.35:8081/solr/xyzcore/
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - No LogReplay
needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - I am
the leader, no recovery necessary
> 2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - publishing
core=xyzcore state=active collection=xyzcore
> 2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - numShards
not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - publishing
core=xyzcore state=down collection=xyzcore
> 2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - numShards
not found on descriptor - reading it from system property
> 2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - :org.apache.solr.common.SolrException:
Cannot publish state of core 'xyzcore' as active without recovering first!
> 	at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
> {code}
> And at this point the necessary shards never recover correctly and hence our core never
returns to a functional state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message