Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CE888D2DD for ; Tue, 28 May 2013 15:34:49 +0000 (UTC) Received: (qmail 70876 invoked by uid 500); 28 May 2013 15:34:45 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 70742 invoked by uid 500); 28 May 2013 15:34:45 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 70730 invoked by uid 99); 28 May 2013 15:34:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2013 15:34:45 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of valginer@research.att.com designates 192.20.225.111 as permitted sender) Received: from [192.20.225.111] (HELO mail-pink.research.att.com) (192.20.225.111) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 May 2013 15:34:38 +0000 Received: from mail-green.research.att.com (unknown [135.207.178.10]) by mail-pink.research.att.com (Postfix) with ESMTP id 02351120336 for ; Tue, 28 May 2013 11:34:15 -0400 (EDT) Received: from [135.207.170.210] (dt-valginer.client.research.att.com [135.207.170.210]) by mail-green.research.att.com (Postfix) with ESMTP id DB24DE0190 for ; Tue, 28 May 2013 11:33:39 -0400 (EDT) Message-ID: <51A4CE87.8090504@research.att.com> Date: Tue, 28 May 2013 11:34:31 -0400 From: Valery Giner User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: solr-user@lucene.apache.org Subject: Re: Distributed query: strange behavior. References: <519E1FC5.9070200@elyograg.org> <519F66A1.40109@research.att.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Eric, Thank you for the explanation. My problem was that allowing the docs with the same unique ids to be present in the multiple shards in a "normal" situation, makes it impossible to estimate the number of shards needed for an index with a "really large" number of docs. Thanks, Val On 05/26/2013 11:16 AM, Erick Erickson wrote: > Valery: > > I share your puzzlement. _If_ you are letting Solr do the document > routing, and not doing any of the custom routing, then the same unique > key should be going to the same shard and replacing the previous doc > with that key. > > But, if you're using custom routing, if you've been experimenting with > different configurations and didn't start over, in general if you're > configuration is in an "interesting" state this could happen. > > So in the normal case if you have a document with the same key indexed > in multiple shards, that would indicate a bug. But there are many > ways, especially when experimenting, that you could have this happen > which are _not_ a bug. I'm guessing that Luis may be trying the custom > routing option maybe? > > Best > Erick > > On Fri, May 24, 2013 at 9:09 AM, Valery Giner wrote: >> Shawn, >> >> How is it possible for more than one document with the same unique key to >> appear in the index, even in different shards? >> Isn't it a bug by definition? >> What am I missing here? >> >> Thanks, >> Val >> >> >> On 05/23/2013 09:55 AM, Shawn Heisey wrote: >>> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: >>>> I've query each Solr shard server one by one and the total number of >>>> documents is correct. However, when I change rows parameter from 10 to >>>> 100 >>>> the total numFound of documents change: >>> I've seen this problem on the list before and the cause has been >>> determined each time to be caused by documents with the same uniqueKey >>> value appearing in more than one shard. >>> >>> What I think happens here: >>> >>> With rows=10, you get the top ten docs from each of the three shards, >>> and each shard sends its numFound for that query to the core that's >>> coordinating the search. The coordinator adds up numFound, looks >>> through those thirty docs, and arranges them according to the requested >>> sort order, returning only the top 10. In this case, there happen to be >>> no duplicates. >>> >>> With rows=100, you get a total of 300 docs. This time, duplicates are >>> found and removed by the coordinator. I think that the coordinator >>> adjusts the total numFound by the number of duplicate documents it >>> removed, in an attempt to be more accurate. >>> >>> I don't know if adjusting numFound when duplicates are found in a >>> sharded query is the right thing to do, I'll leave that for smarter >>> people. Perhaps Solr should return a message with the results saying >>> that duplicates were found, and if a config option is not enabled, the >>> server should throw an exception and return a 4xx HTTP error code. One >>> idea for a config parameter name would be allowShardDuplicates, but >>> something better can probably be found. >>> >>> Thanks, >>> Shawn >>>