lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-9583) When the same <uniqueKey> exists across multiple collections that are searched with an alias, the document returned in the results list is indeterminate
Date Fri, 30 Sep 2016 16:20:20 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536384#comment-15536384
] 

Erick Erickson commented on SOLR-9583:
--------------------------------------

[~dsmiley]]

I disagree and think there's a bug here. I can be persuaded that there are two issues though,
maybe we can split this JIRA.

Bug:
In the situation I described above, we return one doc or the other, and currently it's indeterminate
which one comes back. In fact, the one that comes back will change for the _exact_ same query
without the underlying collections changing at all just by resubmitting the query (I turned
the queryResultCache off and can reproduce at will). This is even true in a one-shard, leader-only
pair of collections. You'll have to argue really hard to persuade me that this is correct
behavior. It's certainly not satisfactory to say to a user "we have no idea which one will
be returned and there's nothing you can do about it, don't even try".

bq: ...it's asking for trouble. Solr isn't supposed to be used this way.

I don't understand this. We allow collection aliasing. There are no rules whatsoever requiring
multiple collections have disjoint <uniqueKey>s. Arbitrarily returning only one is hard
to justify.

Wish:
We add the ability to return all docs with the same ID when multiple collections have docs
with the same ID under control of some flag.


[~noble.paul]

Not quite sure I understand the question. We "dedupe" currently, but it's arbitrary. I doubt
it was designed, rather "just happens" as a side-effect of merging the lists. My suspicion
is that when we merge the results, the final result changes based on the order in which the
collection returns are processed. But before diving into the code I wanted to get some idea
of what we think _should_ happen.

We at least should dedupe in a predictable fashion. What the algorithm should be is up for
discussion. Perhaps "doc from last collection listed in the alias wins" (yuck, frankly but
at least I can explain it to someone). Or maybe "break ties by comparing the collection name"
(also yuck). Or we have to use the sort criteria. Or.... I don't want to get complicated here,
just predictable.

If we decide to return multiple docs with the same ID from separate collections then there's
the whole question of how to sort them, but I'll leave that for another day. Maybe we just
use whatever we use to dedupe as the sort in this case.

> When the same <uniqueKey> exists across multiple collections that are searched
with an alias, the document returned in the results list is indeterminate
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9583
>                 URL: https://issues.apache.org/jira/browse/SOLR-9583
>             Project: Solr
>          Issue Type: Wish
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>
> Not quite sure whether to call this a bug or improvement...
> Currently if I have two collections C1 and C2 and an alias that points to both _and_
I have a document in both collections with the _same_ <unkqueKey>, the returned list
 sometimes has the doc from C1 and sometimes from C2.
> If I add shards.info=true I see the document found in each collection, but only one in
the document list. Which one changes if I re-submit the identical query.
> This seems incorrect, perhaps a side effect of piggy-backing the collection aliasing
on searching multiple shards? (Thanks Shalin for that bit of background).
> I can see both use-cases: 
> 1>  aliasing multiple collections validly assumes that <uniqueKey>s should be
unique across them all and only one doc should be returned. Even in this case which doc should
be returned should be deterministic.
> 2> these are arbitrary collections without any a-priori relationship and identical
<unkqueKey>s do NOT identify the "same" document so both should be returned.
> So I propose we do two things:
> a> provide a param for the CREATEALIAS command that controls whether docs with the
same <unkqueKey> from different collections should both be returned. If they both should,
there's still the question of in what order.
> b> provide a deterministic way dups from different collections are resolved. What
that algorithm is I'm not quite sure. The order the collections were specified in the CREATEALIAS
command? Some field in the documents? Other??? What happens if this option is not specified
on the CREATEALIAS command?
> Implicit in the above is my assumption that it's perfectly valid to have different aliases
in the same cluster behave differently if specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message