Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3987317ED1 for ; Tue, 10 Mar 2015 18:33:14 +0000 (UTC) Received: (qmail 84691 invoked by uid 500); 10 Mar 2015 18:33:03 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 84508 invoked by uid 500); 10 Mar 2015 18:33:03 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 84419 invoked by uid 99); 10 Mar 2015 18:33:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Mar 2015 18:33:02 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mewmewball@gmail.com designates 209.85.160.169 as permitted sender) Received: from [209.85.160.169] (HELO mail-yk0-f169.google.com) (209.85.160.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Mar 2015 18:32:58 +0000 Received: by ykp9 with SMTP id 9so1606261ykp.3 for ; Tue, 10 Mar 2015 11:31:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=wpP1pfugCtmIcyN5t8cFJBZx50GhIChKwS4qIoMXei4=; b=lj8pKBG6HC70YVVnqdNXThRAD4iLpMwl0oncEYDE5Qxllw9V/t3VlZYFtVhXy3OKQa hFxeUjZ4IFcYQ6slC0vo12i7uy+IAj79x1u4mJeSjtyA29Seu1a4u31DJuCC2JiVeNGu 4piMn/SWzGdXEmVkp6GflC0dJh9a2PJDoT7+FhH4hHJ+H7FlGOXfuTNgVAPiEUBSoryX ieiuRtTCEyXsiqYyRf7vme1V2sA5b0ec6wBOmq7m9GVR31dj/KT2Z0PI8CxOClVAuXrx 2SsfWTetrxDaLnbMGtgolEAkz3rOncZJRqT3rwubOEolycrg0EBq55h1BVrCAbfGR18z PQ7A== MIME-Version: 1.0 X-Received: by 10.236.41.78 with SMTP id g54mr19643616yhb.112.1426012312609; Tue, 10 Mar 2015 11:31:52 -0700 (PDT) Received: by 10.170.216.6 with HTTP; Tue, 10 Mar 2015 11:31:52 -0700 (PDT) In-Reply-To: References: Date: Tue, 10 Mar 2015 11:31:52 -0700 Message-ID: Subject: Re: Num docs, block join, and dupes? From: Jessica Mallet To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=089e0160adb656ab920510f35d5b X-Virus-Checked: Checked by ClamAV on apache.org --089e0160adb656ab920510f35d5b Content-Type: text/plain; charset=UTF-8 We've seen this as well. Before we understood the cause, it seemed very bizarre that hitting different nodes would yield different numFound, as well as using different rows=N (since the proxying node only de-dupe the documents that are returned in the response). I think "consistency" and "correctness" should be clearly delineated. Of course we'd rather have consistently correct result, but failing that, I'd rather have consistently incorrect result rather than inconsistent results because otherwise it's even hard to debug, as was the case here. I think either the node hosting the shard should also do the de-duping, or no one should. It's strange that the proxying node decides to do some sketchy limited result set de-dupe. On Tue, Mar 10, 2015 at 9:09 AM, Timothy Potter wrote: > > Before I open a JIRA, I wanted to put this out to solicit feedback on what > I'm seeing and what Solr should be doing. So I've indexed the following 8 > docs into a 2-shard collection (Solr 4.8'ish - internal custom branch > roughly based on 4.8) ... notice that the 3 grand-children of 2-1 have > dup'd keys: > > [ > { > "id":"1", > "name":"parent", > "_childDocuments_":[ > { > "id":"1-1", > "name":"child" > }, > { > "id":"1-2", > "name":"child" > } > ] > }, > { > "id":"2", > "name":"parent", > "_childDocuments_":[ > { > "id":"2-1", > "name":"child", > "_childDocuments_":[ > { > "id":"2-1-1", > "name":"grandchild" > }, > { > "id":"2-1-1", > "name":"grandchild2" > }, > { > "id":"2-1-1", > "name":"grandchild3" > } > ] > } > ] > } > ] > > When I query this collection, using: > > http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10 > > I get: > > { > "responseHeader":{ > "status":0, > "QTime":9, > "params":{ > "indent":"true", > "q":"*:*", > "shards.info":"true", > "wt":"json", > "rows":"10"}}, > "shards.info":{ > " http://localhost:8984/solr/blockjoin2_shard1_replica1/|http://localhost:8985/solr/blockjoin2_shard1_replica2/ ":{ > "numFound":3, > "maxScore":1.0, > "shardAddress":" http://localhost:8984/solr/blockjoin2_shard1_replica1", > "time":4}, > " http://localhost:8984/solr/blockjoin2_shard2_replica1/|http://localhost:8985/solr/blockjoin2_shard2_replica2/ ":{ > "numFound":5, > "maxScore":1.0, > "shardAddress":" http://localhost:8985/solr/blockjoin2_shard2_replica2", > "time":4}}, > "response":{"numFound":6,"start":0,"maxScore":1.0,"docs":[ > { > "id":"1-1", > "name":"child"}, > { > "id":"1-2", > "name":"child"}, > { > "id":"1", > "name":"parent", > "_version_":1495272401329455104}, > { > "id":"2-1-1", > "name":"grandchild"}, > { > "id":"2-1", > "name":"child"}, > { > "id":"2", > "name":"parent", > "_version_":1495272401361960960}] > }} > > > So Solr has de-duped the results. > > If I execute this query against the shard that has the dupes (distrib=false): > > http://localhost:8984/solr/blockjoin2_shard2_replica1/select?q=*%3A*&wt=json&indent=true&shards.info=true&rows=10&distrib=false > > Then the dupes are returned: > > { > "responseHeader":{ > "status":0, > "QTime":0, > "params":{ > "indent":"true", > "q":"*:*", > "shards.info":"true", > "distrib":"false", > "wt":"json", > "rows":"10"}}, > "response":{"numFound":5,"start":0,"docs":[ > { > "id":"2-1-1", > "name":"grandchild"}, > { > "id":"2-1-1", > "name":"grandchild2"}, > { > "id":"2-1-1", > "name":"grandchild3"}, > { > "id":"2-1", > "name":"child"}, > { > "id":"2", > "name":"parent", > "_version_":1495272401361960960}] > }} > > So I guess my question is why doesn't the non-distrib query do > de-duping? Mainly confirming this is how it's supposed to work and > this behavior doesn't strike anyone else as odd ;-) > > Cheers, > > Tim --089e0160adb656ab920510f35d5b--