lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Werner Maier (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3993) SolrCloud leader election on single node stucks the initialization
Date Fri, 16 Nov 2012 14:50:14 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498840#comment-13498840
] 

Werner Maier commented on SOLR-3993:
------------------------------------

Thanks Mark, works better. 
tested nightly 2012-11-16  on a 1-shard-3-node-cluster (hardware, each core on different server,
jetty container).

I still can reproduce a problem:
1) simulate a power outage on the cluster (kill -9 on all servers), leaving the zookeeper
ensemble running.
2) start only ONE core that has NOT been the former leader.

Result: loop in Running recovery...

3) restart that core (clean shutdown with kill instead of kill -9):

Result: all will be fine.
(the node connects to zookeeper, registeres as leader, runs all changes, cleans the zookeeper
queue, and finally 
waits for other replicas to come up. After that timeout it declares itself as leader and all
is fine).

Second Problem (might be the at least similar):
1) setup: 1 shard, 3 codes, zookeeper ensemble on three servers. node 1 is leader. 
2) thistime killall -9 java (shuts down zookeeper ensemble and solr cores - simulated power
outage on all three servers)
3) start solr core on server 2 and 3 (which has NOT been leader). tries to connect to zookeeper,
but can't)
4) start zookeeper on server 2 and 3 (that still simulates hardware failure of server #1)

sometiomes both a core loops in recovery. 
sometimes a core keeps stuck in "shutting down". "INFO: Client->ZooKeeper status change
trigger but we are already closed".

restarting the cores helps everytime. 

kind regards.



                
> SolrCloud leader election on single node stucks the initialization
> ------------------------------------------------------------------
>
>                 Key: SOLR-3993
>                 URL: https://issues.apache.org/jira/browse/SOLR-3993
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0
>         Environment: Windows 7, Tomcat 6
>            Reporter: Alexey Kudinov
>            Assignee: Mark Miller
>             Fix For: 4.1, 5.0
>
>
>  setup:
> 1 node, 4 cores, 2 shards.
> 15 documents indexed.
> problem:
> init stage times out.
> probable cause:
> According to the init flow, cores are initialized one by one synchronously.
> Actually, the main thread waits ShardLeaderElectionContext.waitForReplicasToComeUp until
retry threshold, while replica cores are not yet initialized, in other words there is no chance
other replicas go up in the meanwhile.
> stack trace:
> Thread [main] (Suspended)
>         owns: HashMap<K,V>  (id=3876)
>         owns: StandardContext  (id=3877)
>         owns: HashMap<K,V>  (id=3878)
>         owns: StandardHost  (id=3879)
>         owns: StandardEngine  (id=3880)
>         owns: Service[]  (id=3881)
>         Thread.sleep(long) line: not available [native method]
>         ShardLeaderElectionContext.waitForReplicasToComeUp(boolean, String) line: 298
>         ShardLeaderElectionContext.runLeaderProcess(boolean) line: 143
>         LeaderElector.runIamLeaderProcess(ElectionContext, boolean) line: 152
>         LeaderElector.checkIfIamLeader(int, ElectionContext, boolean) line: 96
>         LeaderElector.joinElection(ElectionContext) line: 262
>         ZkController.joinElection(CoreDescriptor, boolean) line: 733
>         ZkController.register(String, CoreDescriptor, boolean, boolean) line: 566
>         ZkController.register(String, CoreDescriptor) line: 532
>         CoreContainer.registerInZk(SolrCore) line: 709
>         CoreContainer.register(String, SolrCore, boolean) line: 693
>         CoreContainer.load(String, InputSource) line: 535
>         CoreContainer.load(String, File) line: 356
>         CoreContainer$Initializer.initialize() line: 308
>         SolrDispatchFilter.init(FilterConfig) line: 107
>         ApplicationFilterConfig.getFilter() line: 295
>         ApplicationFilterConfig.setFilterDef(FilterDef) line: 422
>         ApplicationFilterConfig.<init>(Context, FilterDef) line: 115
>         StandardContext.filterStart() line: 4072
>         StandardContext.start() line: 4726
>         StandardHost(ContainerBase).addChildInternal(Container) line: 799
>         StandardHost(ContainerBase).addChild(Container) line: 779
>         StandardHost.addChild(Container) line: 601
>         HostConfig.deployDescriptor(String, File, String) line: 675
>         HostConfig.deployDescriptors(File, String[]) line: 601
>         HostConfig.deployApps() line: 502
>         HostConfig.start() line: 1317
>         HostConfig.lifecycleEvent(LifecycleEvent) line: 324
>         LifecycleSupport.fireLifecycleEvent(String, Object) line: 142
>         StandardHost(ContainerBase).start() line: 1065
>         StandardHost.start() line: 840
>         StandardEngine(ContainerBase).start() line: 1057
>         StandardEngine.start() line: 463
>         StandardService.start() line: 525
>         StandardServer.start() line: 754
>         Catalina.start() line: 595
>         NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available
[native method]
>         NativeMethodAccessorImpl.invoke(Object, Object[]) line: not available
>         DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: not available
>         Method.invoke(Object, Object...) line: not available
>         Bootstrap.start() line: 289
>         Bootstrap.main(String[]) line: 414
>        
> After a while, the session times out and following exception appears:
> Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=0 timeoutin=-95
> Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp
> INFO: Was waiting for replicas to come up, but they are taking too long - assuming they
won't come back till later
> Oct 25, 2012 1:16:56 PM org.apache.solr.common.SolrException log
> SEVERE: Errir checking for the number of election participants:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /collections/collection1/leader_elect/shard2/election
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249)
>         at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:227)
>         at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:224)
>         at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63)
>         at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:224)
>         at org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp(ElectionContext.java:276)
>         at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:143)
>         at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:152)
>         at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
>         at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:262)
>         at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:733)
>         at org.apache.solr.cloud.ZkController.register(ZkController.java:566)
>         at org.apache.solr.cloud.ZkController.register(ZkController.java:532)
>         at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:709)
>         at org.apache.solr.core.CoreContainer.register(CoreContainer.java:693)
>         at org.apache.solr.core.CoreContainer.load(CoreContainer.java:535)
>         at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
>         at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
>         at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
>         at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
>         at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
>         at org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.java:115)
>         at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
>         at org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
>         at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
>         at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
>         at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
>         at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:675)
>         at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:601)
>         at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502)
>         at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317)
>         at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
>         at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
>         at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065)
>         at org.apache.catalina.core.StandardHost.start(StandardHost.java:840)
>         at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057)
>         at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463)
>         at org.apache.catalina.core.StandardService.start(StandardService.java:525)
>         at org.apache.catalina.core.StandardServer.start(StandardServer.java:754)
>         at org.apache.catalina.startup.Catalina.start(Catalina.java:595)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
>         at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
> Followed by:
> Oct 25, 2012 1:17:27 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> SEVERE: Recovery failed - trying again... core=collection1
> Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover. core=collection1
> Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover. core=collection1:org.apache.solr.common.SolrException:
No registered leader was found, collection:collection1 slice:shard1
>         at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:413)
>         at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:399)
>         at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:318)
>         at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:220)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message