Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66B6AD9A0 for ; Thu, 1 Nov 2012 10:37:16 +0000 (UTC) Received: (qmail 87277 invoked by uid 500); 1 Nov 2012 10:37:15 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 87047 invoked by uid 500); 1 Nov 2012 10:37:15 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 86993 invoked by uid 99); 1 Nov 2012 10:37:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Nov 2012 10:37:14 +0000 Date: Thu, 1 Nov 2012 10:37:14 +0000 (UTC) From: "Markus Jelsma (JIRA)" To: dev@lucene.apache.org Message-ID: <556842652.55381.1351766234418.JavaMail.jiratomcat@arcas> In-Reply-To: <2039244269.27759.1351178711941.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (SOLR-3993) SolrCloud leader election on single node stucks the initialization MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488596#comment-13488596 ] Markus Jelsma commented on SOLR-3993: ------------------------------------- We're seeing this too using a current trunk on a 10 node test cluster. We can trigger this state if we restart all servlet containers sequentially or roughly at the same time, most nodes are stuck in this state too while others sometimes change to a different state throwing all kinds of exceptions (see list). We can only get out of this state by restarting some servlet containers again. It does _not_ happen at all with Zookeeper's data directory wiped clean. The cluster then starts very nicely. > SolrCloud leader election on single node stucks the initialization > ------------------------------------------------------------------ > > Key: SOLR-3993 > URL: https://issues.apache.org/jira/browse/SOLR-3993 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.0 > Environment: Windows 7, Tomcat 6 > Reporter: Alexey Kudinov > > setup: > 1 node, 4 cores, 2 shards. > 15 documents indexed. > problem: > init stage times out. > probable cause: > According to the init flow, cores are initialized one by one synchronously. > Actually, the main thread waits ShardLeaderElectionContext.waitForReplicasToComeUp until retry threshold, while replica cores are not yet initialized, in other words there is no chance other replicas go up in the meanwhile. > stack trace: > Thread [main] (Suspended) > owns: HashMap (id=3876) > owns: StandardContext (id=3877) > owns: HashMap (id=3878) > owns: StandardHost (id=3879) > owns: StandardEngine (id=3880) > owns: Service[] (id=3881) > Thread.sleep(long) line: not available [native method] > ShardLeaderElectionContext.waitForReplicasToComeUp(boolean, String) line: 298 > ShardLeaderElectionContext.runLeaderProcess(boolean) line: 143 > LeaderElector.runIamLeaderProcess(ElectionContext, boolean) line: 152 > LeaderElector.checkIfIamLeader(int, ElectionContext, boolean) line: 96 > LeaderElector.joinElection(ElectionContext) line: 262 > ZkController.joinElection(CoreDescriptor, boolean) line: 733 > ZkController.register(String, CoreDescriptor, boolean, boolean) line: 566 > ZkController.register(String, CoreDescriptor) line: 532 > CoreContainer.registerInZk(SolrCore) line: 709 > CoreContainer.register(String, SolrCore, boolean) line: 693 > CoreContainer.load(String, InputSource) line: 535 > CoreContainer.load(String, File) line: 356 > CoreContainer$Initializer.initialize() line: 308 > SolrDispatchFilter.init(FilterConfig) line: 107 > ApplicationFilterConfig.getFilter() line: 295 > ApplicationFilterConfig.setFilterDef(FilterDef) line: 422 > ApplicationFilterConfig.(Context, FilterDef) line: 115 > StandardContext.filterStart() line: 4072 > StandardContext.start() line: 4726 > StandardHost(ContainerBase).addChildInternal(Container) line: 799 > StandardHost(ContainerBase).addChild(Container) line: 779 > StandardHost.addChild(Container) line: 601 > HostConfig.deployDescriptor(String, File, String) line: 675 > HostConfig.deployDescriptors(File, String[]) line: 601 > HostConfig.deployApps() line: 502 > HostConfig.start() line: 1317 > HostConfig.lifecycleEvent(LifecycleEvent) line: 324 > LifecycleSupport.fireLifecycleEvent(String, Object) line: 142 > StandardHost(ContainerBase).start() line: 1065 > StandardHost.start() line: 840 > StandardEngine(ContainerBase).start() line: 1057 > StandardEngine.start() line: 463 > StandardService.start() line: 525 > StandardServer.start() line: 754 > Catalina.start() line: 595 > NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method] > NativeMethodAccessorImpl.invoke(Object, Object[]) line: not available > DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: not available > Method.invoke(Object, Object...) line: not available > Bootstrap.start() line: 289 > Bootstrap.main(String[]) line: 414 > > After a while, the session times out and following exception appears: > Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp > INFO: Waiting until we see more replicas up: total=2 found=0 timeoutin=-95 > Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp > INFO: Was waiting for replicas to come up, but they are taking too long - assuming they won't come back till later > Oct 25, 2012 1:16:56 PM org.apache.solr.common.SolrException log > SEVERE: Errir checking for the number of election participants:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/collection1/leader_elect/shard2/election > at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249) > at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:227) > at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:224) > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63) > at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:224) > at org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp(ElectionContext.java:276) > at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:143) > at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:152) > at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96) > at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:262) > at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:733) > at org.apache.solr.cloud.ZkController.register(ZkController.java:566) > at org.apache.solr.cloud.ZkController.register(ZkController.java:532) > at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:709) > at org.apache.solr.core.CoreContainer.register(CoreContainer.java:693) > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:535) > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) > at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308) > at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107) > at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) > at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) > at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115) > at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072) > at org.apache.catalina.core.StandardContext.start(StandardContext.java:4726) > at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799) > at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) > at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) > at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:675) > at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:601) > at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502) > at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) > at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324) > at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142) > at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) > at org.apache.catalina.core.StandardHost.start(StandardHost.java:840) > at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) > at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) > at org.apache.catalina.core.StandardService.start(StandardService.java:525) > at org.apache.catalina.core.StandardServer.start(StandardServer.java:754) > at org.apache.catalina.startup.Catalina.start(Catalina.java:595) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) > at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) > Followed by: > Oct 25, 2012 1:17:27 PM org.apache.solr.cloud.RecoveryStrategy doRecovery > SEVERE: Recovery failed - trying again... core=collection1 > Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log > SEVERE: Error while trying to recover. core=collection1 > Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log > SEVERE: Error while trying to recover. core=collection1:org.apache.solr.common.SolrException: No registered leader was found, collection:collection1 slice:shard1 > at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:413) > at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:399) > at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:318) > at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:220) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org