lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Bickerstaff <j...@johnbickerstaff.com>
Subject Re: Verifying - SOLR Cloud replaces load balancer?
Date Tue, 19 Apr 2016 17:26:24 GMT
@Charlie

It's easy to do and wow does it save time and database resources...

I've built a Spring Boot Micro-services architecture that also registers in
Zookeeper.  One micro-service pulls from the original data source and
pushes to Kafka.  The second micro-service pulls from Kafka into SOLR.

Because they're registered in Zookeeper, the micro-services can be brought
up anywhere in the infrastructure I'm building and "rebuild" SOLR indices
from scratch.

I.E. if you lose SOLR completely, just bring up a new VM copy with an empty
index, start your microservice, and rebuild the index from scratch

We're dropping it all into AWS eventually.

It's sweet.  The original "run" to consolidate the data from various
databases takes over an hour -- IF the load on production is light. Running
out of Kafka takes less than 10 minutes and totally avoids loading
production databases.

If you're interested, ping me -- I'm happy to share what I've got...

On Tue, Apr 19, 2016 at 2:08 AM, Charlie Hull <charlie@flax.co.uk> wrote:

> On 18/04/2016 18:22, John Bickerstaff wrote:
>
>> So - my IT guy makes the case that we don't really need Zookeeper / Solr
>> Cloud...
>>
>> He may be right - we're serving static data (changes to the collection
>> occur only 2 or 3 times a year and are minor)
>>
>> We probably could have 3 or 4 Solr nodes running in non-Cloud mode -- each
>> configured the same way, behind a load balancer and do fine.
>>
>> I've got a Kafka server set up with the solr docs as topics.  It takes
>> about 10 minutes to reload a "blank" Solr Server from the Kafka topic...
>> If I target 3-4 SOLR servers from my microservice instead of one, it
>> wouldn't take much longer than 10 minutes to concurrently reload all 3 or
>> 4
>> Solr servers from scratch...
>>
>
> This is something we've been discussing as a concept - to offload all the
> scaling stuff to Kafka (which is very good at that sort of thing) and
> simply hang Solr instances onto a Kafka topic. We've not taken it any
> further than a concept at this point but interesting to hear about others
> doing so!
>
> Charlie
>
>
>
>> I'm biased in terms of using the most recent functionality, but I'm aware
>> that bias is not necessarily based on facts and want to do my due
>> diligence...
>>
>> Aside from the obvious benefits of spreading work across nodes (which may
>> not be a big deal in our application and which my IT guy proposes is more
>> transparently handled with a load balancer he understands) are there any
>> other considerations that would drive a choice for Solr Cloud (zookeeper
>> etc)?
>>
>>
>>
>> On Mon, Apr 18, 2016 at 9:26 AM, Tom Evans <tevans.uk@googlemail.com>
>> wrote:
>>
>> On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
>>> <john@johnbickerstaff.com> wrote:
>>>
>>>> Thanks all - very helpful.
>>>>
>>>> @Shawn - your reply implies that even if I'm hitting the URL for a
>>>> single
>>>> endpoint via HTTP - the "balancing" will still occur across the Solr
>>>>
>>> Cloud
>>>
>>>> (I understand the caveat about that single endpoint being a potential
>>>>
>>> point
>>>
>>>> of failure).  I just want to verify that I'm interpreting your response
>>>> correctly...
>>>>
>>>> (I have been asked to provide IT with a comprehensive list of options
>>>>
>>> prior
>>>
>>>> to a design discussion - which is why I'm trying to get clear about the
>>>> various options)
>>>>
>>>> In a nutshell, I think I understand the following:
>>>>
>>>> a. Even if hitting a single URL, the Solr Cloud will "balance" across
>>>> all
>>>> available nodes for searching
>>>>            Caveat: That single URL represents a potential single point
>>>> of
>>>> failure and this should be taken into account
>>>>
>>>> b. SolrJ's CloudSolrClient API provides the ability to distribute load
>>>> --
>>>> based on Zookeeper's "knowledge" of all available Solr instances.
>>>>            Note: This is more robust than "a" due to the fact that it
>>>> eliminates the "single point of failure"
>>>>
>>>> c.  Use of a load balancer hitting all known Solr instances will be fine
>>>>
>>> -
>>>
>>>> although the search requests may not run on the Solr instance the load
>>>> balancer targeted - due to "a" above.
>>>>
>>>> Corrections or refinements welcomed...
>>>>
>>>
>>> With option a), although queries will be distributed across the
>>> cluster, all queries will be going through that single node. Not only
>>> is that a single point of failure, but you risk saturating the
>>> inter-node network traffic, possibly resulting in lower QPS and higher
>>> latency on your queries.
>>>
>>> With option b), as well as SolrJ, recent versions of pysolr have a
>>> ZK-aware SolrCloud client that behaves in a similar way.
>>>
>>> With option c), you can use the preferLocalShards so that shards that
>>> are local to the queried node are used in preference to distributed
>>> shards. Depending on your shard/cluster topology, this can increase
>>> performance if you are returning large amounts of data - many or large
>>> fields or many documents.
>>>
>>> Cheers
>>>
>>> Tom
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message