lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches
Date Mon, 26 Feb 2018 19:13:24 GMT
Did you try enabling distributed IDF (statsCache)? See:
https://lucene.apache.org/solr/guide/6_6/distributed-requests.html

It's may not totally fix the issue, but it's worth trying. It does
come with a performance penalty of course.

Best,
Erick

On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <webster.homer@sial.com> wrote:
> Thanks Shawn, I had settled on this as a solution.
>
> All our use cases for Solr is to return results in order of relevancy to
> the query, so having a deterministic sort would defeat that purpose. Since
> we wanted to be able to return all the results for a query, I originally
> looked at using the Streaming API, but that doesn't support returning
> results sorted by relevancy
>
> I disagree with you about NRT replicas though. They may function as
> designed, but since they cannot guarantee consistent results their design
> is buggy, at least it is for a search engine.
>
>
> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <apache@elyograg.org> wrote:
>
>> On 2/26/2018 10:26 AM, Webster Homer wrote:
>> > We need the results by relevancy so the application sorts the results by
>> > score desc, and the unique id ascending as the tie breaker
>>
>> This is the reason for the discrepancy, and why the different replica
>> types don't have the same issue.
>>
>> Each NRT replica can have different deleted documents than the others,
>> just due to the way that NRT replicas work.  Deleted documents affect
>> relevancy scoring.  When one replica has say 5000 deleted documents and
>> another has 200, or has 5000 but they're different docs, a relevancy
>> sort can end up different.  So when Solr goes to one replica for page 1
>> and another for page 2 (which is expected due to SolrCloud's internal
>> load balancing), you may end up with duplicate documents or documents
>> missing.  Because deleted documents are not counted or returned,
>> numFound will be consistent, as long as the index doesn't change between
>> the queries for pages.
>>
>> If you were using a deterministic sort rather than relevancy, this
>> wouldn't be happening, because deleted documents have no influence on
>> that kind of sort.
>>
>> With TLOG or PULL, the replicas are absolutely identical, so there is no
>> difference, unless the index is changing as you page through the results.
>>
>> I think changing replica types is the only solution here.  NRT replicas
>> are working as they were designed -- there's no bug, even though
>> problems like this do sometimes turn up.
>>
>> Thanks,
>> Shawn
>>
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Mime
View raw message