Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of markrmiller@gmail.com
 designates 209.85.220.170 as permitted sender)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\))
Subject: Re: SolrCloud liveness problems
From: Mark Miller <markrmiller@gmail.com>
In-Reply-To: <08F6DD73-D9AD-4D6F-9831-741C47B5D358@boxalino.com>
Date: Tue, 17 Sep 2013 14:43:26 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <2202D45C-51E3-4474-9699-E1493AAAD6FB@gmail.com>
References: <08F6DD73-D9AD-4D6F-9831-741C47B5D358@boxalino.com>
To: solr-user@lucene.apache.org


On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic =
<vladimir.veljkovic@boxalino.com> wrote:

> Hello there,
>=20
> we have following setup:
>=20
> SolrCloud 4.4.0 (3 nodes, physical machines)
> Zookeeper 3.4.5 (3 nodes, physical machines)
>=20
> We have a number of rather small collections (~10K or ~100K of =
documents), that we would like to load to all Solr instances =
(numShards=3D1, replication_factor=3D3), and access them through local =
network interface, as the load balancing is done in layers above.
>=20
> We can live (and we actually do it in the test phase) with updating =
the entire collections whenever we need it, switching collection aliases =
and removing the old collections.
>=20
> We stumbled across following problem: as soon as all three Solr nodes =
become a leader to at least one collection, restarting any node makes it =
completely unresponsive (timeout), both though admin interface and for =
replication. If we restart all solr nodes the cluster end up in some =
kind of deadlock and only remedy we found is Solr clean installation, =
removing ZooKeeper data and re-posting collections.
>=20
> Apparently, leader is waiting for replicas to come up and they try to =
synchronize but timeout on http requests, so everything ends up in some =
kind of dead lock, maybe related to:
>=20
> https://issues.apache.org/jira/browse/SOLR-5240

Yup, that sounds exactly what you would expect with SOLR-5240. A fix for =
that is coming in 4.5, which is a probably a week or so away.

>=20
> Eventually (after few minutes), leader takes over, mark collections =
"active" but remains blocked on http interface, so other nodes can not =
synchronize.
>=20
> In further tests, we loaded 4 collections with numShards=3D1 and =
replication_factor=3D2. By chance, one node become the leader for all 4 =
collections. Restarting the node which was not the leader is done =
without the problem, but when we restarted the leader it happened that:
> - leader shut down, other nodes became leaders of 2 collections each
> - leader starts up, 3 collections on it become "active", one =
collection remains =94down=94 and node becomes unresponsive and timeouts =
on http requests.

Hard to say - I'll experiment with 4.5 and see if I can duplicate this.

- Mark

>=20
> As this behavior is completely unexpected for one cluster solution, I =
wonder if somebody else experienced same problems or we are doing =
something entirely wrong.
>=20
> Best regards
>=20
> --=20
>=20
> Vladimir Veljkovic
> Senior Java Entwickler
>=20
> Boxalino AG
>=20
> vladimir.veljkovic@boxalino.com=20
> www.boxalino.com=20
>=20
>=20
> Tuning Kit for your Online Shop
>=20
> Product Search - Recommendations - Landing Pages - Data intelligence - =
Mobile Commerce=20
>=20
>=20