lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: SolrCloud liveness problems
Date Wed, 18 Sep 2013 03:50:30 GMT
SOLR-5243 and SOLR-5240 will likely improve the situation. Both fixes are in 4.5 - the first
RC for 4.5 will likely come tomorrow.

Thanks to yonik for sussing these out.

- Mark

On Sep 17, 2013, at 2:43 PM, Mark Miller <markrmiller@gmail.com> wrote:

> 
> On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic <vladimir.veljkovic@boxalino.com>
wrote:
> 
>> Hello there,
>> 
>> we have following setup:
>> 
>> SolrCloud 4.4.0 (3 nodes, physical machines)
>> Zookeeper 3.4.5 (3 nodes, physical machines)
>> 
>> We have a number of rather small collections (~10K or ~100K of documents), that we
would like to load to all Solr instances (numShards=1, replication_factor=3), and access them
through local network interface, as the load balancing is done in layers above.
>> 
>> We can live (and we actually do it in the test phase) with updating the entire collections
whenever we need it, switching collection aliases and removing the old collections.
>> 
>> We stumbled across following problem: as soon as all three Solr nodes become a leader
to at least one collection, restarting any node makes it completely unresponsive (timeout),
both though admin interface and for replication. If we restart all solr nodes the cluster
end up in some kind of deadlock and only remedy we found is Solr clean installation, removing
ZooKeeper data and re-posting collections.
>> 
>> Apparently, leader is waiting for replicas to come up and they try to synchronize
but timeout on http requests, so everything ends up in some kind of dead lock, maybe related
to:
>> 
>> https://issues.apache.org/jira/browse/SOLR-5240
> 
> Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that is coming
in 4.5, which is a probably a week or so away.
> 
>> 
>> Eventually (after few minutes), leader takes over, mark collections "active" but
remains blocked on http interface, so other nodes can not synchronize.
>> 
>> In further tests, we loaded 4 collections with numShards=1 and replication_factor=2.
By chance, one node become the leader for all 4 collections. Restarting the node which was
not the leader is done without the problem, but when we restarted the leader it happened that:
>> - leader shut down, other nodes became leaders of 2 collections each
>> - leader starts up, 3 collections on it become "active", one collection remains ”down”
and node becomes unresponsive and timeouts on http requests.
> 
> Hard to say - I'll experiment with 4.5 and see if I can duplicate this.
> 
> - Mark
> 
>> 
>> As this behavior is completely unexpected for one cluster solution, I wonder if somebody
else experienced same problems or we are doing something entirely wrong.
>> 
>> Best regards
>> 
>> -- 
>> 
>> Vladimir Veljkovic
>> Senior Java Entwickler
>> 
>> Boxalino AG
>> 
>> vladimir.veljkovic@boxalino.com 
>> www.boxalino.com 
>> 
>> 
>> Tuning Kit for your Online Shop
>> 
>> Product Search - Recommendations - Landing Pages - Data intelligence - Mobile Commerce

>> 
>> 
> 


Mime
View raw message