lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valery Giner <valgi...@research.att.com>
Subject Re: Distributed query: strange behavior.
Date Tue, 28 May 2013 15:34:31 GMT
Eric,

Thank you for the explanation.

My problem was that allowing the docs with the same unique ids  to be 
present in the multiple shards in a "normal" situation,
makes it impossible to estimate the number of shards needed for an index 
with a "really large" number of docs.

Thanks,
Val

On 05/26/2013 11:16 AM, Erick Erickson wrote:
> Valery:
>
> I share your puzzlement. _If_ you are letting Solr do the document
> routing, and not doing any of the custom routing, then the same unique
> key should be going to the same shard and replacing the previous doc
> with that key.
>
> But, if you're using custom routing, if you've been experimenting with
> different configurations and didn't start over, in general if you're
> configuration is in an "interesting" state this could happen.
>
> So in the normal case if you have a document with the same key indexed
> in multiple shards, that would indicate a bug. But there are many
> ways, especially when experimenting, that you could have this happen
> which are _not_ a bug. I'm guessing that Luis may be trying the custom
> routing option maybe?
>
> Best
> Erick
>
> On Fri, May 24, 2013 at 9:09 AM, Valery Giner <valginer@research.att.com> wrote:
>> Shawn,
>>
>> How is it possible for more than one document with the same unique key to
>> appear in the index, even in different shards?
>> Isn't it a bug by definition?
>> What am I missing here?
>>
>> Thanks,
>> Val
>>
>>
>> On 05/23/2013 09:55 AM, Shawn Heisey wrote:
>>> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
>>>> I've query each Solr shard server one by one and the total number of
>>>> documents is correct. However, when I change rows parameter from 10 to
>>>> 100
>>>> the total numFound of documents change:
>>> I've seen this problem on the list before and the cause has been
>>> determined each time to be caused by documents with the same uniqueKey
>>> value appearing in more than one shard.
>>>
>>> What I think happens here:
>>>
>>> With rows=10, you get the top ten docs from each of the three shards,
>>> and each shard sends its numFound for that query to the core that's
>>> coordinating the search.  The coordinator adds up numFound, looks
>>> through those thirty docs, and arranges them according to the requested
>>> sort order, returning only the top 10.  In this case, there happen to be
>>> no duplicates.
>>>
>>> With rows=100, you get a total of 300 docs.  This time, duplicates are
>>> found and removed by the coordinator.  I think that the coordinator
>>> adjusts the total numFound by the number of duplicate documents it
>>> removed, in an attempt to be more accurate.
>>>
>>> I don't know if adjusting numFound when duplicates are found in a
>>> sharded query is the right thing to do, I'll leave that for smarter
>>> people.  Perhaps Solr should return a message with the results saying
>>> that duplicates were found, and if a config option is not enabled, the
>>> server should throw an exception and return a 4xx HTTP error code.  One
>>> idea for a config parameter name would be allowShardDuplicates, but
>>> something better can probably be found.
>>>
>>> Thanks,
>>> Shawn
>>>


Mime
View raw message