lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Brown <...@intelcompute.com>
Subject Re: Querying only replica's
Date Sun, 10 Jan 2016 19:00:14 GMT
I'm thinking more about how the external load-balancer will know if a 
node is down, as to take it out the pool of active servers to even 
attempt sending a query to.

I could ping tho that just means the IP is alive.  I could configure the 
load-balancer to actually try a query, but this may be (even a tiny) 
performance hit.

Is there another recommended way of configuring external load-balancers 
to know when a node is not accepting queries?



On 10/01/16 18:25, Erick Erickson wrote:
> For health checks, you can go ahead and get the real IP addresses and
> ping them directly if you care to.... Or just let Zookeeper do that
> for you. One of the tasks of Zookeeper is pinging all the machines
> with all the replicas and, if any of them are unreachable, telling the
> rest of the cluster that that machine is down.
>
> Best,
> Erick
>
> On Sun, Jan 10, 2016 at 5:19 AM, Robert Brown <rob@intelcompute.com> wrote:
>> Thanks Erick,
>>
>> For the health-checks on the load-balancer side, would you recommend a
>> simple query, or is there a reliable ping or similar for this scenario?
>>
>> Cheers,
>> Rob
>>
>>
>> On 09/01/16 23:44, Erick Erickson wrote:
>>> bq: is it best/good to get the CLUSTERSTATUS via the collection API
>>> and explicitly send queries to a replica to ensure I don't send
>>> queries to the leaders of my collection
>>>
>>> In a word _no_. SolrCloud is vastly different than the old
>>> master/slave. In SolrCloud, each and every node (leader and replicas)
>>> index all the docs and serve queries. The additional burden the leader
>>> has is actually very small. There's absolutely no reason to _not_ use
>>> the leader to serve queries.
>>>
>>> As far as sending updates, there would be a _little_ benefit to
>>> sending the updates directly to the leader, but _far_ more benefit in
>>> using SolrJ. If you use SolrJ (and CloudSolrClient), then the
>>> documents are split up on the _client_ and only the docs for a
>>> particular shard are automatically sent to the leader for that shard.
>>> Using SolrJ you can essentially scale indexing linearly with the
>>> number of shards you have. Just using HTTP does not scale linearly.
>>> Your particular app may not care, but in high-throughput situations
>>> this can be significant.
>>>
>>> So rather than spend time and effort sending updates directly to a
>>> leader and have the leader then forward the docs to the correct shard,
>>> I recommend investing the time in using SolrJ for updates rather than
>>> sending updates to the leader over HTTP. Or just ignore the problem
>>> and devote your efforts to something that are more valuable.
>>>
>>> So in short:
>>> 1> just stick a load balancer in front of _all_ your Solr nodes for
>>> queries. And note that there's an internal load balancer already in
>>> Solr that routes things around anyway, although putting a load
>>> balancer in front of your entire cluster makes it so there's not a
>>> single point of failure.
>>> 2> Depending on your throughput needs, either
>>> 2a> use SolrJ to index
>>> 2b> don't worry about it and send updates through the load balancer as
>>> well. There'll be an extra hop if you send updates to a replica, but
>>> if that's significant you should be using SolrJ
>>>
>>> As for 5.5, it's not at all clear that there _will_ be a 5.5. 5.4 was
>>> just released in early December. There's usually a several month lag
>>> between point releases and there's some agitation to start the 6.0
>>> release process, so it's up in the air.
>>>
>>>
>>> On Sat, Jan 9, 2016 at 12:04 PM, Robert Brown <rob@intelcompute.com>
>>> wrote:
>>>> Hi,
>>>>
>>>> (btw, when is 5.5 due?  I see the docs reference it, but not the download
>>>> page)
>>>>
>>>> Anyway, I index and query Solr over HTTP (no SolrJ, etc.) - is it
>>>> best/good
>>>> to get the CLUSTERSTATUS via the collection API and explicitly send
>>>> queries
>>>> to a replica to ensure I don't send queries to the leaders of my
>>>> collection,
>>>> to improve performance?  Like-wise with sending updates directly to a
>>>> Leader?
>>>>
>>>> My leaders will receive full updates of the entire collection once a day,
>>>> so
>>>> I would assume if the leader is handling queries too, performance would
>>>> be
>>>> hit?
>>>>
>>>> Is the CLUSTERSTATUS API the only way to do this btw without SolrJ, etc.?
>>>> I
>>>> wasn't sure if ZooKeeper would be able to tell me also.
>>>>
>>>> Do I also need to do anything to ensure the leaders are never sent
>>>> queries
>>>> from the replica's?
>>>>
>>>> Does this all sound sane?
>>>>
>>>> One of my collections is 3 shards, with 2 replica's each (9 total nodes),
>>>> 70m docs in total.
>>>>
>>>> Thanks,
>>>> Rob
>>>>


Mime
View raw message