lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jessica Mallet <mewmewb...@gmail.com>
Subject Re: Num docs, block join, and dupes?
Date Tue, 10 Mar 2015 18:31:52 GMT
We've seen this as well. Before we understood the cause, it seemed very
bizarre that hitting different nodes would yield different numFound, as
well as using different rows=N (since the proxying node only de-dupe the
documents that are returned in the response).

I think "consistency" and "correctness" should be clearly delineated. Of
course we'd rather have consistently correct result, but failing that, I'd
rather have consistently incorrect result rather than inconsistent results
because otherwise it's even hard to debug, as was the case here.

I think either the node hosting the shard should also do the de-duping, or
no one should. It's strange that the proxying node decides to do some
sketchy limited result set de-dupe.

On Tue, Mar 10, 2015 at 9:09 AM, Timothy Potter <thelabdude@gmail.com>
wrote:
>
> Before I open a JIRA, I wanted to put this out to solicit feedback on what
> I'm seeing and what Solr should be doing. So I've indexed the following 8
> docs into a 2-shard collection (Solr 4.8'ish - internal custom branch
> roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have
> dup'd keys:
>
> [
>   {
>     "id":"1",
>     "name":"parent",
>     "_childDocuments_":[
>       {
>         "id":"1-1",
>         "name":"child"
>       },
>       {
>         "id":"1-2",
>         "name":"child"
>       }
>     ]
>   },
>   {
>     "id":"2",
>     "name":"parent",
>     "_childDocuments_":[
>       {
>         "id":"2-1",
>         "name":"child",
>         "_childDocuments_":[
>           {
>             "id":"2-1-1",
>             "name":"grandchild"
>           },
>           {
>             "id":"2-1-1",
>             "name":"grandchild2"
>           },
>           {
>             "id":"2-1-1",
>             "name":"grandchild3"
>           }
>         ]
>       }
>     ]
>   }
> ]
>
> When I query this collection, using:
>
>
http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10
>
> I get:
>
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":9,
>     "params":{
>       "indent":"true",
>       "q":"*:*",
>       "shards.info":"true",
>       "wt":"json",
>       "rows":"10"}},
>   "shards.info":{
>     "
http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/
":{
>       "numFound":3,
>       "maxScore":1.0,
>       "shardAddress":"
http://localhost:8984/solr/blockjoin2_shard1_replica1",
>       "time":4},
>     "
http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/
":{
>       "numFound":5,
>       "maxScore":1.0,
>       "shardAddress":"
http://localhost:8985/solr/blockjoin2_shard2_replica2",
>       "time":4}},
>   "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[
>       {
>         "id":"1-1",
>         "name":"child"},
>       {
>         "id":"1-2",
>         "name":"child"},
>       {
>         "id":"1",
>         "name":"parent",
>         "_version_":1495272401329455104},
>       {
>         "id":"2-1-1",
>         "name":"grandchild"},
>       {
>         "id":"2-1",
>         "name":"child"},
>       {
>         "id":"2",
>         "name":"parent",
>         "_version_":1495272401361960960}]
>   }}
>
>
> So Solr has de-duped the results.
>
> If I execute this query against the shard that has the dupes
(distrib=false):
>
>
http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false
>
> Then the dupes are returned:
>
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":0,
>     "params":{
>       "indent":"true",
>       "q":"*:*",
>       "shards.info":"true",
>       "distrib":"false",
>       "wt":"json",
>       "rows":"10"}},
>   "response":{"numFound":5,"start":0,"docs":[
>       {
>         "id":"2-1-1",
>         "name":"grandchild"},
>       {
>         "id":"2-1-1",
>         "name":"grandchild2"},
>       {
>         "id":"2-1-1",
>         "name":"grandchild3"},
>       {
>         "id":"2-1",
>         "name":"child"},
>       {
>         "id":"2",
>         "name":"parent",
>         "_version_":1495272401361960960}]
>   }}
>
> So I guess my question is why doesn't the non-distrib query do
> de-duping? Mainly confirming this is how it's supposed to work and
> this behavior doesn't strike anyone else as odd ;-)
>
> Cheers,
>
> Tim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message