lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Date Tue, 07 Oct 2014 02:52:51 GMT
I think there were some holes that would allow replicas and leaders to
be out of synch that have been patched up in the last 3 releases.

There shouldn't be anything you need to do to keep these in synch, so
if you can capture what happened when things got out of synch we'll
fix it. But a lot has changed in the last several months, so the first
thing I'd do if possible is to upgrade to 4.10.1.


Best,
Erick

On Mon, Oct 6, 2014 at 2:41 PM, S.L <simpleliving016@gmail.com> wrote:
> Hi Erick,
>
> Before I tried your suggestion of  issung a commit=true update, I realized that for eaach
shard there was atleast a node that had its index directory named like index.<timestamp>.
>
> I went ahead and deleted index directory that restarted that core and now the index directory
got syched with the other node and is properly named as 'index' without any timestamp attached
to it.This is now giving me consistent results for distrib=true using a load balancer.Also
distrib=false returns expexted results for a given shard.
>
> The underlying issue appears to be that in every shard the leader and the replica(follower)
were out of sych.
>
> How can I avoid this from happening again?
>
> Thanks for your help!
>
> Sent from my HTC
>
> ----- Reply message -----
> From: "Erick Erickson" <erickerickson@gmail.com>
> To: <solr-user@lucene.apache.org>
> Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
> Date: Fri, Oct 3, 2014 12:56 AM
>
> Hmmmm. Assuming that you aren't re-indexing the doc you're searching for...
>
> Try issuing http://blah blah:8983/solr/collection/update?commit=true.
> That'll force all the docs to be searchable. Does <1> still hold for
> the document in question? Because this is exactly backwards of what
> I'd expect. I'd expect, if anything, the replica (I'm trying to call
> it the "follower" when a distinction needs to be made since the leader
> is a "replica" too....) would be out of sync. This is still a Bad
> Thing, but the leader gets first crack at indexing thing.
>
> bq: only the replica of the shard that has this key returns the result
> , and the leader does not ,
>
> Just to be sure we're talking about the same thing. When you say
> "leader", you mean the shard leader, right? The filled-in circle on
> the graph view from the admin/cloud page.
>
> And let's see your soft and hard commit settings please.
>
> Best,
> Erick
>
> On Thu, Oct 2, 2014 at 9:48 PM, S.L <simpleliving016@gmail.com> wrote:
>> Eirck,
>>
>> 0> Load balancer is out of the picture
>> .
>> 1>When I query with *distrib=false* , I get consistent results as expected
>> for those shards that dont have the key i.e I dont get the results back for
>> those shards, however I just realized that while *distrib=false* is present
>> in the query for the shard that is supposed to contain the key,only the
>> replica of the shard that has this key returns the result , and the leader
>> does not , looks like replica and the leader do not have the same data and
>> replica seems to contain the key in the query for that shard.
>>
>> 2> By indexing I mean this collection is being populated by a web crawler.
>>
>> So looks like 1> above  is pointing to leader and replica being out of
>> synch for atleast one shard.
>>
>>
>>
>> On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>>
>>> bq: Also ,the collection is being actively indexed as I query this, could
>>> that
>>> be an issue too ?
>>>
>>> Not if the documents you're searching aren't being added as you search
>>> (and all your autocommit intervals have expired).
>>>
>>> I would turn off indexing for testing, it's just one more variable
>>> that can get in the way of understanding this.
>>>
>>> Do note that if the problem were endemic to Solr, there would probably
>>> be a _lot_ more noise out there.
>>>
>>> So to recap:
>>> 0> we can take the load balancer out of the picture all together.
>>>
>>> 1> when you query each shard individually with &distrib=true, every
>>> replica in a particular shard returns the same count.
>>>
>>> 2> when you query without &distrib=true you get varying counts.
>>>
>>> This is very strange and not at all expected. Let's try it again
>>> without indexing going on....
>>>
>>> And what do you mean by "indexing" anyway? How are documents being fed
>>> to your system?
>>>
>>> Best,
>>> Erick@PuzzledAsWell
>>>
>>> On Thu, Oct 2, 2014 at 7:32 PM, S.L <simpleliving016@gmail.com> wrote:
>>> > Erick,
>>> >
>>> > I would like to add that the interesting behavior i.e point #2 that I
>>> > mentioned in my earlier reply  happens in all the shards , if this were
>>> to
>>> > be a distributed search issue this should have not manifested itself in
>>> the
>>> > shard that contains the key that I am searching for , looks like the
>>> search
>>> > is just failing as whole intermittently .
>>> >
>>> > Also ,the collection is being actively indexed as I query this, could
>>> that
>>> > be an issue too ?
>>> >
>>> > Thanks.
>>> >
>>> > On Thu, Oct 2, 2014 at 10:24 PM, S.L <simpleliving016@gmail.com> wrote:
>>> >
>>> >> Erick,
>>> >>
>>> >> Thanks for your reply, I tried your suggestions.
>>> >>
>>> >> 1 . When not using loadbalancer if  *I have distrib=false* I get
>>> >> consistent results across the replicas.
>>> >>
>>> >> 2. However here's the insteresting part , while not using load balancer
>>> if
>>> >> I *dont have distrib=false* , then when I query a particular node ,I
get
>>> >> the same behaviour as if I were using a loadbalancer , meaning the
>>> >> distributed search from a node works intermittently .Does this give
any
>>> >> clue ?
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson <erickerickson@gmail.com
>>> >
>>> >> wrote:
>>> >>
>>> >>> Hmmm, nothing quite makes sense here....
>>> >>>
>>> >>> Here are some experiments:
>>> >>> 1> avoid the load balancer and issue queries like
>>> >>> http://solr_server:8983/solr/collection/q=whatever&distrib=false
>>> >>>
>>> >>> the &distrib=false bit will cause keep SolrCloud from trying
to send
>>> >>> the queries anywhere, they'll be served only from the node you address
>>> >>> them to.
>>> >>> that'll help check whether the nodes are consistent. You should
be
>>> >>> getting back the same results from each replica in a shard (i.e.
2 of
>>> >>> your 6 machines).
>>> >>>
>>> >>> Next, try your failing query the same way.
>>> >>>
>>> >>> Next, try your failing query from a browser, pointing it at successive
>>> >>> nodes.
>>> >>>
>>> >>> Where is the first place problems show up?
>>> >>>
>>> >>> My _guess_ is that your load balancer isn't quite doing what you
>>> think, or
>>> >>> your cluster isn't set up the way you think it is, but those are
>>> guesses.
>>> >>>
>>> >>> Best,
>>> >>> Erick
>>> >>>
>>> >>> On Thu, Oct 2, 2014 at 2:51 PM, S.L <simpleliving016@gmail.com>
wrote:
>>> >>> > Hi All,
>>> >>> >
>>> >>> > I am trying to query a 6 node Solr4.7  cluster with 3 shards
and  a
>>> >>> > replication factor of 2 .
>>> >>> >
>>> >>> > I have fronted these 6 Solr nodes using a load balancer , what
I
>>> notice
>>> >>> is
>>> >>> > that every time I do a search of the form
>>> >>> > q=*:*&fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it
gives me a
>>> result
>>> >>> > only once in every 3 tries , telling me that the load balancer
is
>>> >>> > distributing the requests between the 3 shards and SolrCloud
only
>>> >>> returns a
>>> >>> > result if the request goes to the core that as that id .
>>> >>> >
>>> >>> > However if I do a simple search like q=*:* , I consistently
get the
>>> >>> right
>>> >>> > aggregated results back of all the documents across all the
shards
>>> for
>>> >>> > every request from the load balancer. Can someone please let
me know
>>> >>> what
>>> >>> > this is symptomatic of ?
>>> >>> >
>>> >>> > Somehow Solr Cloud seems to be doing search query distribution
and
>>> >>> > aggregation for queries of type *:* only.
>>> >>> >
>>> >>> > Thanks.
>>> >>>
>>> >>
>>> >>
>>>

Mime
View raw message