lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Cascading failures with replicas
Date Sat, 18 Mar 2017 19:46:51 GMT
bug# 2, Solr shouldn't be adding replicas by itself unless you
specified autoAddReplicas=true when you created the collection. It
default to "false". So I'm not sure what's going on here.

bug #3. The internal load balancers are round-robin, so this is
expected. Not optimal I'll grant but expected.

bug #4. What shard placement rules are you using? There are a series
of rules for replica placement and one of the criteria (IIRC) is
exactly to try to distribute replicas to different hosts. Although
there was some glitchiness whether two JVMs on the same _host_ were
considered "the same host" or not.

bug #1 has been more or less of a pain for quite a while, work is ongoing there.


On Fri, Mar 17, 2017 at 5:40 PM, Walter Underwood <> wrote:
> I’m running a 4x4 cluster (4 shards, replication factor 4) on 16 hosts. I shut down
Solr on one host because it got into some kind of bad, can’t-recover state where it was
causing timeouts across the whole cluster (bug #1).
> I ran a load benchmark near the capacity of the cluster. This had run fine in test, this
was the prod cluster.
> Solr Cloud added a replica to replace the down node. The node with two cores got double
the traffic and started slowly flapping in and out of service. The 95th percentile response
spiked from 3 seconds to 100 seconds. At some point, another replica was made, with two replicas
from the same shard on the same instance. Naturally, that was overloaded, and I killed the
benchmark out of charity.
> Bug #2 is creating a new replica when one host is down. This should be an option and
default to “false”, because it causes the cascade.
> Bug #3 is sending equal traffic to each core without considering the host. Each host
should get equal traffic, not each core.
> Bug #4 is putting two replicas from the same shard on one instance. That is just asking
for trouble.
> When it works, this cluster is awesome.
> wunder
> Walter Underwood
>  (my blog)

View raw message