lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Per Steffensen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3765) Wrong handling of documents with same id in cross collection searches
Date Thu, 30 Aug 2012 19:18:08 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445210#comment-13445210
] 

Per Steffensen commented on SOLR-3765:
--------------------------------------

No problem. Glad to help. 

We will not be working on a fix. We will do a workaround in our own application, so that we
will not have id-clash across collections. We need to control ids very strictly in order for
our fail-on-unique-key-constraint-violaton to serve its purpose correctly. Basically we just
prefix our ids with the name of the collection - will still provide unique-key-clash within
the collection but will not prevent documents with same id (except for the collection-name-part)
from being returned/counted.
                
> Wrong handling of documents with same id in cross collection searches
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3765
>                 URL: https://issues.apache.org/jira/browse/SOLR-3765
>             Project: Solr
>          Issue Type: Bug
>          Components: search, SolrCloud
>    Affects Versions: 4.0
>         Environment: Self-build version of Solr fra 4.x branch (revision )
>            Reporter: Per Steffensen
>              Labels: collections, inconsistency, numFound, search
>
> Dialog with myself from solr-users mailing list:
> Per Steffensen skrev:
> {quote} 
> Hi
> Due to what we have seen in recent tests I got in doubt how Solr search is actually supposed
to behave
> * Searching with "distrib=true&q=*:*&rows=10&collection=x,y,z&sort=timestamp
asc"
> ** Is Solr supposed to return the 10 documents with the lowest timestamp across all documents
in all slices of collection x, y and z, or is it supposed to just pick 10 random documents
from those slices and just sort those 10 randomly selected documents?
> ** Put in another way - is this search supposed to be consistent, returning exactly the
same set of documents when performed several times (no documents are updated between consecutive
searches)?
> {quote}
> Fortunately I believe the answer is, that it ought to "return the 10 documents with the
lowest timestamp across all documents in all slices of collection x, y and Z". The reason
I asked was because I got different responses for consecutive simular requests. Now I believe
it can be explained by the bug described below. I guess they you do cross-collection/shard
searches, the "request-handling" Solr forwards the query to all involved shards simultanious
and merges sub-results into the final result as they are returned from the shards. Because
of the "consider documents with same id as the same document even though the come from different
collections"-bug it is kinda random (depending on which shards responds first/last), for a
given id, what collection the document with that specific id is taken from. And if documents
with the same id from different collections has different timestamp it is random where that
document ends up in the final sorted result.
> So i believe this inconsistency can be explained by the bug described below.
> {quote}
> * A search returns a "numFound"-field telling how many documents all in all matches the
search-criteria, even though not all those documents are returned by the search. It is a crazy
question to ask, but I will do it anyway because we actually see a problem with this. Isnt
it correct that two searches which only differs on the "rows"-number (documents to be returned)
should always return the same value for "numFound"?
> {quote}
> Well I found out myself what the problem is (or seems to be) - see:
> http://lucene.472066.n3.nabble.com/Changing-value-of-start-parameter-affects-numFound-td2460645.html
> http://lucene.472066.n3.nabble.com/numFound-inconsistent-for-different-rows-param-td3997269.html
> http://lucene.472066.n3.nabble.com/Solr-v3-5-0-numFound-changes-when-paging-through-results-on-8-shard-cluster-td3990400.html
> Until 4.0 this "bug" could be "ignored" because it was ok for a cross-shards search to
consider documents with identical id's as dublets and therefore only returning/counting one
of them. It is still, in 4.0, ok within the same collection, but across collections identical
id's should not be considered dublicates and should not reduce documents returned/counted.
So i believe this "feature" has now become a bug in 4.0 when it comes to cross-collections
searches.
> {quote}
> Thanks!
> Regards, Steff
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message