Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4050110D9E for ; Tue, 17 Sep 2013 18:44:15 +0000 (UTC) Received: (qmail 8919 invoked by uid 500); 17 Sep 2013 18:44:00 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 8849 invoked by uid 500); 17 Sep 2013 18:44:00 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 8828 invoked by uid 99); 17 Sep 2013 18:43:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Sep 2013 18:43:57 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of markrmiller@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vc0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Sep 2013 18:43:51 +0000 Received: by mail-vc0-f170.google.com with SMTP id kw10so4548379vcb.29 for ; Tue, 17 Sep 2013 11:43:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=QWSuZfGDsVwofnMKmvAUEhlXl0vsbjzeISk6mB4P6XU=; b=r+QxghAQZB84q9Uzjen5XMNlXi6dR1gNkVNdcN3Ls4qho6/yjY5C2jrHBpn2JwZsYQ MKAz28VsEpljNpS+nVaqiyOC7fjMSK8VWO8HdYqdlAULpPoZASyIr/kNnpG0yvJL2avt a/+smF5wPft2NsXqBZndXrIKphy3QG1JbBCCFEns/DwRv3jNKKebr1u+LOIB3o+nSEg1 E+J5sDyJ2mnaO2Y6gM+fgGB4S3C1ogenMqBxWiQOK8g1qpm0sGGtIhfo4zP6ORgJVDjh 44ssxnZ7X/zrpuUGgHS2132km7dCMgz578IVV8ZQuDS9I/vbF3jLFrtPs2gGed3sqcDI gaXg== X-Received: by 10.58.73.202 with SMTP id n10mr33868935vev.7.1379443410396; Tue, 17 Sep 2013 11:43:30 -0700 (PDT) Received: from [192.168.1.4] (ool-18bf2b82.dyn.optonline.net. [24.191.43.130]) by mx.google.com with ESMTPSA id i6sm26078908vdv.1.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 17 Sep 2013 11:43:28 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: SolrCloud liveness problems From: Mark Miller In-Reply-To: <08F6DD73-D9AD-4D6F-9831-741C47B5D358@boxalino.com> Date: Tue, 17 Sep 2013 14:43:26 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <2202D45C-51E3-4474-9699-E1493AAAD6FB@gmail.com> References: <08F6DD73-D9AD-4D6F-9831-741C47B5D358@boxalino.com> To: solr-user@lucene.apache.org X-Mailer: Apple Mail (2.1510) X-Virus-Checked: Checked by ClamAV on apache.org On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic = wrote: > Hello there, >=20 > we have following setup: >=20 > SolrCloud 4.4.0 (3 nodes, physical machines) > Zookeeper 3.4.5 (3 nodes, physical machines) >=20 > We have a number of rather small collections (~10K or ~100K of = documents), that we would like to load to all Solr instances = (numShards=3D1, replication_factor=3D3), and access them through local = network interface, as the load balancing is done in layers above. >=20 > We can live (and we actually do it in the test phase) with updating = the entire collections whenever we need it, switching collection aliases = and removing the old collections. >=20 > We stumbled across following problem: as soon as all three Solr nodes = become a leader to at least one collection, restarting any node makes it = completely unresponsive (timeout), both though admin interface and for = replication. If we restart all solr nodes the cluster end up in some = kind of deadlock and only remedy we found is Solr clean installation, = removing ZooKeeper data and re-posting collections. >=20 > Apparently, leader is waiting for replicas to come up and they try to = synchronize but timeout on http requests, so everything ends up in some = kind of dead lock, maybe related to: >=20 > https://issues.apache.org/jira/browse/SOLR-5240 Yup, that sounds exactly what you would expect with SOLR-5240. A fix for = that is coming in 4.5, which is a probably a week or so away. >=20 > Eventually (after few minutes), leader takes over, mark collections = "active" but remains blocked on http interface, so other nodes can not = synchronize. >=20 > In further tests, we loaded 4 collections with numShards=3D1 and = replication_factor=3D2. By chance, one node become the leader for all 4 = collections. Restarting the node which was not the leader is done = without the problem, but when we restarted the leader it happened that: > - leader shut down, other nodes became leaders of 2 collections each > - leader starts up, 3 collections on it become "active", one = collection remains =94down=94 and node becomes unresponsive and timeouts = on http requests. Hard to say - I'll experiment with 4.5 and see if I can duplicate this. - Mark >=20 > As this behavior is completely unexpected for one cluster solution, I = wonder if somebody else experienced same problems or we are doing = something entirely wrong. >=20 > Best regards >=20 > --=20 >=20 > Vladimir Veljkovic > Senior Java Entwickler >=20 > Boxalino AG >=20 > vladimir.veljkovic@boxalino.com=20 > www.boxalino.com=20 >=20 >=20 > Tuning Kit for your Online Shop >=20 > Product Search - Recommendations - Landing Pages - Data intelligence - = Mobile Commerce=20 >=20 >=20