Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 04FAA18984 for ; Wed, 27 Apr 2016 01:20:37 +0000 (UTC) Received: (qmail 17212 invoked by uid 500); 27 Apr 2016 01:20:32 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 17143 invoked by uid 500); 27 Apr 2016 01:20:32 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 17131 invoked by uid 99); 27 Apr 2016 01:20:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Apr 2016 01:20:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 728D61A07B5 for ; Wed, 27 Apr 2016 01:20:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.821 X-Spam-Level: X-Spam-Status: No, score=-0.821 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id wIRbu17O2UfD for ; Wed, 27 Apr 2016 01:20:27 +0000 (UTC) Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com [209.85.213.182]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 907425F3A0 for ; Wed, 27 Apr 2016 01:20:27 +0000 (UTC) Received: by mail-ig0-f182.google.com with SMTP id m9so28249000ige.1 for ; Tue, 26 Apr 2016 18:20:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=uFU9Ys7u76eNh9wtcDd6EoEUSwHJ9DjpVnVx5rRCBh4=; b=WA1DY5ra6rgAX8YIFBcw6h7cw7L7gF78FF6ogN1ExdN6kOxT9PyHYsjPJRKxHZ2raE xSdlcaqCtV/z5Ej53tPYvqFYgtAetUUMvl1vgF/XSvsQBwLuay8RKU6LO31Yw4IoBQ/P j0DIOu+q6XpqYYhirWLmUK7NVzA4KuwGU9g2SxtfcEabdD3cXDfW3Ay66VKLTVcMiHye kbiHJu52HbJsgbXnd25Q8g4TTSShmJF4qyc+rGmR8akoMWtGSE/K0NEnV3ySgLwwq5Ns I0uuqTspfJIWYj7EgCsP/bz0WfXWd/1oGCU61LlrfsiyeHT0v7yB+Yb48mjJ8pSfAHA9 EnjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=uFU9Ys7u76eNh9wtcDd6EoEUSwHJ9DjpVnVx5rRCBh4=; b=ZLlcd8eF8BVR4N6EZnTziNSoIOReNwLt9jHZlxbD4ZqI8DryyKTx+zeW8VCYm1Lf0H jAs8SV7KgdffXUkTQJJ/YTruvldiCxEc2DnEx/VPoJRqZnXoswhWttYKIBa12EZpECfi tEovBJ/IJ9VNKfAglvbbV6+PmBmK/ozMqq0S+82o3oEx81ETomSSl75Azr4hXl2Xr9Fq CfNHE2nWRFiqkkGl9HJnjwI0BBOn9mtYQ9EzQ3UnkS81mivBo1E1MQBMtYBavIX4HyeK zpnaPYLgnICkeXJUiHhNhzeSUf1E+6pWosNU22Qt4g/3QAQviI7FqZniwG5cnHmxYPTG 8o6g== X-Gm-Message-State: AOPr4FVQeU09NEsUMAL7idvHGhAuemwq4zcSF5AdhqP+s58+iyeoKiWBG4n+6VLQOZI9dZt4wkXdU7P2j2mjEg== X-Received: by 10.50.171.66 with SMTP id as2mr7211541igc.57.1461720021551; Tue, 26 Apr 2016 18:20:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.12.18 with HTTP; Tue, 26 Apr 2016 18:20:02 -0700 (PDT) In-Reply-To: References: From: Erick Erickson Date: Tue, 26 Apr 2016 18:20:02 -0700 Message-ID: Subject: Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core. To: solr-user Content-Type: text/plain; charset=UTF-8 One of the reasons this happens is if you have very long GC cycles, longer than the Zookeeper "keep alive" timeout. During a full GC pause, Solr is unresponsive and if the ZK ping times out, ZK assumes the machine is gone and you get into this recovery state. So I'd collect GC logs and see if you have any stop-the-world GC pauses that take longer than the ZK timeout. see Mark Millers primer on GC here: https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/ Best, Erick On Tue, Apr 26, 2016 at 2:13 PM, Li Ding wrote: > Thank you all for your help! > > The zookeeper log rolled over, thisis from Solr.log: > > Looks like the solr and zk connection is gone for some reason > > INFO - 2016-04-21 12:37:57.536; > org.apache.solr.common.cloud.ConnectionManager; Watcher > org.apache.solr.common.cloud.ConnectionManager@19789a96 > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent > state:Disconnected type:None path:null path:null type:None > > INFO - 2016-04-21 12:37:57.536; > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected > > INFO - 2016-04-21 12:38:24.248; > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired > - starting a new one... > > INFO - 2016-04-21 12:38:24.262; > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to > connect to ZooKeeper > > INFO - 2016-04-21 12:38:24.269; > org.apache.solr.common.cloud.ConnectionManager; Connected:true > > > Then it publishes all cores on the hosts are down. I just list three cores > here: > > INFO - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController; > publishing core=product1_shard1_replica1 state=down > > INFO - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController; > publishing core=collection1 state=down > > INFO - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController; > numShards not found on descriptor - reading it from system property > > INFO - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController; > publishing core=product2_shard5_replica1 state=down > > INFO - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController; > publishing core=product2_shard13_replica1 state=down > > > product1 has only one shard one replica and it's able to be active > successfully: > > INFO - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController; > Register replica - core:product1_shard1_replica1 address:http:// > {internalIp}:8983/solr collection:product1 shard:shard1 > > WARN - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext; > cancelElection did not find election node to remove > > INFO - 2016-04-21 12:38:26.393; > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader > process for shard shard1 > > INFO - 2016-04-21 12:38:26.399; > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to > continue. > > INFO - 2016-04-21 12:38:26.399; > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - > try and sync > > INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/ > > INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync > Success - now sync replicas to me > > INFO - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; > http://{internalIp}:8983/solr/product1_shard1_replica1/ > has no replicas > > INFO - 2016-04-21 12:38:26.399; > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: > http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1 > > INFO - 2016-04-21 12:38:26.399; org.apache.solr.common.cloud.SolrZkClient; > makePath: /collections/product1/leaders/shard1 > > INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We are > http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is > http://{internalIp}:8983/solr/product1_shard1_replica1/ > > INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No > LogReplay needed for core=product1_replica1 baseURL=http:// > {internalIp}:8983/solr > > INFO - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am > the leader, no recovery necessary > > INFO - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController; > publishing core=product1_shard1_replica1 state=active > > > product2 has 15 shards one replica but only two shards lived on this > machine, this is one of the failed shard that I never seen the message of > the core product2_shard5_replica1 active: > > INFO - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController; > Register replica - product2_shard5_replica1 address:http:// > {internalIp}:8983/solr collection:product2 shard:shard5 > > WARN - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext; > cancelElection did not find election node to remove > > INFO - 2016-04-21 12:38:26.625; > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader > process for shard shard5 > > INFO - 2016-04-21 12:38:26.631; > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to > continue. > > INFO - 2016-04-21 12:38:26.631; > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - > try and sync > > INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync > replicas to http:// > {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ > > INFO - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync > Success - now sync replicas to me > > INFO - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy; > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ > has no replicas > > INFO - 2016-04-21 12:38:26.632; > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ > shard5 > > INFO - 2016-04-21 12:38:26.632; org.apache.solr.common.cloud.SolrZkClient; > makePath: /collections/product2_shard5_replica1/leaders/shard5 > > INFO - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We are > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ and > leader is http://{internalIp}:8983/solr > product2_shard5_replica1_shard5_replica1/ > > INFO - 2016-04-21 12:38:26.646; > org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from > ZooKeeper... > > > Before I restarted this server, a bunch of queries failed for this > collection product2. But I don't think it will affect the core status. > > > Do you guys have any idea about why this particular core is not published > as active since from the log, most steps are done except the very last one > to publish info to ZK. > > > Thanks, > > > Li > On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari > wrote: > >> Hi Li, >> >> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s" >> if its the case, this may be related to >> https://issues.apache.org/jira/browse/SOLR-7940, >> and i would say either use the patch file or upgrade. >> >> >> *Thanks,* >> *Rajesh,* >> *8328789519,* >> *If I don't answer your call please leave a voicemail with your contact >> info, * >> *will return your call ASAP.* >> >> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang >> wrote: >> >> > Hi >> > We have used Solr4.6 for 2 years,If you post more logs ,maybe we can >> > fixed it. >> > >> > 2016-04-21 6:50 GMT+08:00 Li Ding : >> > >> > > Hi All, >> > > >> > > We are using SolrCloud 4.6.1. We have observed following behaviors >> > > recently. A Solr node in a Solrcloud cluster is up but some of the >> cores >> > > on the nodes are marked as down in Zookeeper. If the cores are parts >> of >> > a >> > > multi-sharded collection with one replica, the queries to that >> > collection >> > > will fail. However, when this happened, if we issue queries to the >> core >> > > directly, it returns 200 and correct info. But once Solr got into the >> > > state, the core will be marked down forever unless we do a restart on >> > Solr. >> > > >> > > Has anyone seen this behavior before? Is there any to get out of the >> > state >> > > on its own? >> > > >> > > Thanks, >> > > >> > > Li >> > > >> > >>