Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 043DB10F9C for ; Mon, 9 Mar 2015 01:41:46 +0000 (UTC) Received: (qmail 18069 invoked by uid 500); 9 Mar 2015 01:41:39 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 17901 invoked by uid 500); 9 Mar 2015 01:41:39 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 17498 invoked by uid 99); 9 Mar 2015 01:41:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 01:41:39 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of markrmiller@gmail.com designates 209.85.192.52 as permitted sender) Received: from [209.85.192.52] (HELO mail-qg0-f52.google.com) (209.85.192.52) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 01:41:13 +0000 Received: by qgdq107 with SMTP id q107so25198549qgd.7 for ; Sun, 08 Mar 2015 18:40:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=pSoP10xGbH9Q85UFn7XkldplpZJvUPWiaaozy1wlkng=; b=jdw5Ng/wzg5KyM88krm2AtpF0W0jWQe5s9Hg00BiFC531+SF57bWD+i5CmiwGPMuRv XTXWcyuY86e+jgndpFv475DNnE8dvJAl2vAi2/3EwEp/eNHSCiMpYDnyLxlkmzIu1Z2V Ni+NuldFEmFtBxMIxfzd335yQF5f668z+H3/kK3vVahtkMtqTQQ14hJrhFmdnDTl0XbZ xZTLakp3XcHsAqcObCbukRJU13GCU7jPaVp9zbl09S3RS7N9uSc4JcISZayvUyLFBgeV JoD0y9Ssw7BxnNI52Z67847vxcSJnXYfESmqC4sKzI2OVUG53ftYxp3euaFuaKwA3yi4 qZyA== X-Received: by 10.140.150.21 with SMTP id 21mr33597759qhw.69.1425865227016; Sun, 08 Mar 2015 18:40:27 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Mark Miller Date: Mon, 09 Mar 2015 01:40:26 +0000 Message-ID: Subject: Re: 4.10.4 - nodes up, shard without leader To: solr-user Content-Type: multipart/alternative; boundary=001a11353fbc5a9fd10510d11ec6 X-Virus-Checked: Checked by ClamAV on apache.org --001a11353fbc5a9fd10510d11ec6 Content-Type: text/plain; charset=UTF-8 Interesting bug. First there is the already closed transaction log. That by itself deserves a look. I'm not even positive we should be replaying the log we reconnecting from ZK disconnect, but even if we do, this should never happen. Beyond that there seems to be some race. Because of the log trouble, we try and cancel the election - but we don't find the ephemeral election node yet for some reason and so just assume it's fine, no node there to remove (well, we WARN, because it is a little unexpected). Then that ephemeral node materializes I guess, and the new leader doesn't register because the old leader won't give up the thrown. We don't try and force the new leader because that may just hide bugs and cause data loss, we no leader is elected. I'd guess there are two JIRA issues to resolve here. - Mark On Sun, Mar 8, 2015 at 8:37 AM Markus Jelsma wrote: > Hello - i stumbled upon an issue i've never seen earlier, a shard with all > nodes up and running but no leader. This is on 4.10.4. One of the two nodes > emits the following error log entry: > > 2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136] - > : cancelElection did not find election node to remove > /overseer_elect/election/93434598784958483-178.21.116. > 225:8080_solr-n_0000000246 > 2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136] - > : cancelElection did not find election node to remove > /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116. > 225:8080_solr_oi_h-n_0000000043 > 2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - : > Error inspecting tlog tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001394 > refcount=2} > java.nio.channels.ClosedChannelException > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) > at org.apache.solr.update.ChannelFastInputStream. > readWrappedStream(TransactionLog.java:784) > at org.apache.solr.common.util.FastInputStream.refill( > FastInputStream.java:89) > at org.apache.solr.common.util.FastInputStream.read( > FastInputStream.java:125) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.solr.update.TransactionLog.endsWithCommit( > TransactionLog.java:218) > at org.apache.solr.update.UpdateLog.recoverFromLog( > UpdateLog.java:800) > at org.apache.solr.cloud.ZkController.register( > ZkController.java:841) > at org.apache.solr.cloud.ZkController$1.command( > ZkController.java:277) > at org.apache.solr.common.cloud.ConnectionManager$1$1.run( > ConnectionManager.java:166) > 2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - : > Error inspecting tlog tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0000000000000001471 > refcount=2} > java.nio.channels.ClosedChannelException > at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99) > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679) > at org.apache.solr.update.ChannelFastInputStream. > readWrappedStream(TransactionLog.java:784) > at org.apache.solr.common.util.FastInputStream.refill( > FastInputStream.java:89) > at org.apache.solr.common.util.FastInputStream.read( > FastInputStream.java:125) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.solr.update.TransactionLog.endsWithCommit( > TransactionLog.java:218) > at org.apache.solr.update.UpdateLog.recoverFromLog( > UpdateLog.java:800) > at org.apache.solr.cloud.ZkController.register( > ZkController.java:841) > at org.apache.solr.cloud.ZkController$1.command( > ZkController.java:277) > at org.apache.solr.common.cloud.ConnectionManager$1$1.run( > ConnectionManager.java:166) > 2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] - > [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h coreNodeName= > 178.21.116.225:8080_solr_oi_h > > The other node makes a mess in the logs: > > 2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] - > [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c coreNodeName= > 194.145.201.190: > 8080_solr_oi_c > 2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] - > [zkCallback-2-thread-19] - : There was a problem trying to register as the > leader:org. > apache.solr.common.SolrException: Could not register as the leader > because creating the ephemeral registration node in ZooKeeper failed > at org.apache.solr.cloud.ShardLeaderElectionContextBase > .runLeaderProcess(ElectionContext.java:146) > at org.apache.solr.cloud.ShardLeaderElectionContext. > runLeaderProcess(ElectionContext.java:317) > at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess( > LeaderElector.java:163) > at org.apache.solr.cloud.LeaderElector.checkIfIamLeader( > LeaderElector.java:125) > at org.apache.solr.cloud.LeaderElector.access$200( > LeaderElector.java:55) > at org.apache.solr.cloud.LeaderElector$ElectionWatcher. > process(LeaderElector.java:358) > at org.apache.solr.common.cloud.SolrZkClient$3$1.run( > SolrZkClient.java:210) > at java.util.concurrent.Executors$RunnableAdapter. > call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.solr.common.SolrException: org.apache.zookeeper. > KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for > /collections/openi > ndex/leaders/shard3 > at org.apache.solr.common.util.RetryUtil.retryOnThrowable( > RetryUtil.java:40) > at org.apache.solr.cloud.ShardLeaderElectionContextBase > .runLeaderProcess(ElectionContext.java:134) > ... 11 more > Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: > KeeperErrorCode = NodeExists for /collections/oi/leaders/shard3 > at org.apache.zookeeper.KeeperException.create( > KeeperException.java:119) > at org.apache.zookeeper.KeeperException.create( > KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) > at org.apache.solr.common.cloud.SolrZkClient$11.execute( > SolrZkClient.java:463) > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation( > ZkCmdExecutor.java:74) > at org.apache.solr.common.cloud.SolrZkClient.makePath( > SolrZkClient.java:460) > at org.apache.solr.common.cloud.SolrZkClient.makePath( > SolrZkClient.java:417) > at org.apache.solr.common.cloud.SolrZkClient.makePath( > SolrZkClient.java:404) > at org.apache.solr.cloud.ShardLeaderElectionContextBase > $1.execute(ElectionContext.java:138) > at org.apache.solr.common.util.RetryUtil.retryOnThrowable( > RetryUtil.java:34) > ... 12 more > > 2015-03-08 05:26:08,682 WARN [solr.cloud.ElectionContext] - > [zkCallback-2-thread-19] - : cancelElection did not find election node to > remove /collections/open > index/leader_elect/shard3/election/93434598784958480-194.145.201.190:8080 > _solr_oi_c-n_0000000044 > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Error while trying to recover. > core=oi_c:org.apache.solr.common.Solr > Exception: No registered leader was found after waiting for 4000ms , > collection: oi slice: shard3 > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:568) > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:551) > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > RecoveryStrategy.java:330) > at org.apache.solr.cloud.RecoveryStrategy.run( > RecoveryStrategy.java:235) > > ... and then a continuous mess like below > > 2015-03-08 05:26:12,692 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Recovery failed - trying again... (0) core=oi_c > 2015-03-08 05:26:26,703 ERROR [solr.cloud.RecoveryStrategy] - > [RecoveryThread] - : Error while trying to recover. > core=oi_c:org.apache.solr.common.SolrException: No registered leader was > found after waiting for 4000ms , collection: oi slice: shard3 > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:568) > at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry( > ZkStateReader.java:551) > at org.apache.solr.cloud.RecoveryStrategy.doRecovery( > RecoveryStrategy.java:330) > at org.apache.solr.cloud.RecoveryStrategy.run( > RecoveryStrategy.java:235) > > I haven't seen this before, the cluster was unavailable for indexing, > search still worked as expected. Is there an open issue i can't seem to > find now? > > Markus > --001a11353fbc5a9fd10510d11ec6--