Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 14163 invoked from network); 2 Nov 2009 13:18:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Nov 2009 13:18:30 -0000 Received: (qmail 51174 invoked by uid 500); 2 Nov 2009 13:18:28 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 51099 invoked by uid 500); 2 Nov 2009 13:18:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 51089 invoked by uid 99); 2 Nov 2009 13:18:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Nov 2009 13:18:27 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-iw0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Nov 2009 13:18:23 +0000 Received: by iwn6 with SMTP id 6so3126748iwn.20 for ; Mon, 02 Nov 2009 05:18:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=xnbFiI0curTFXcZP34NVsJ6aT7kMg2Dhxpra+2bU6IU=; b=Mo/yq11ybKwq5v6JgpjC1v/RP9748/3qO9ZCifIY+PnupfeopwZQ5175XbHZ9plAQ0 hqUGg5Cg2Yez9TUXBK96Vm0WXtMXOc7AoW8XkHPdI9mNKGVjhqRNAtBr3I7e3u3FXoMV HJclwCYgVDeUy7eOYWBuAxhe9ze0WOpnDxsoc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=wWGoiOv5kHfiWZgeGeGbClIC2I6/JMm3wcTtAgVk5ee7U2xRLQzvIoaaSLmJ7aBO/H yBF5942/aophWRGLH2YMmiMyHMIIy+1RA77a1Dqr+ZBh+perXatw8x4jBj6AipElt3O7 rAjyKDsmo1Ek1mITssDZfWXiJCOYBksnSPpKY= MIME-Version: 1.0 Received: by 10.231.170.201 with SMTP id e9mr1067587ibz.16.1257167878624; Mon, 02 Nov 2009 05:17:58 -0800 (PST) In-Reply-To: <717FD000-0BBA-4343-B3DD-FB86DBE24517@gmail.com> References: <04120229-EF7A-4AFB-BACE-CDC31F100D55@gmail.com> <359a92830911020459k220025dds8b34cd03bd5a1cfd@mail.gmail.com> <717FD000-0BBA-4343-B3DD-FB86DBE24517@gmail.com> Date: Mon, 2 Nov 2009 08:17:58 -0500 Message-ID: <359a92830911020517oc2a2474u97e46909d819877c@mail.gmail.com> Subject: Re: Different score for the same documents From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=005045018102d46d0104776334d3 --005045018102d46d0104776334d3 Content-Type: text/plain; charset=ISO-8859-1 That's exactly the question. If all 16 documents have exactly the same score, then the internal tie-breaking is your answer. They would also all have strictly increasing doc IDs. But I'd check to see the scores before accepting this explanation because I find it unlikely that all 16 docs have identical scores. But it's worth checking out before looking for more complex answers..... Best Erick On Mon, Nov 2, 2009 at 8:11 AM, kenji tsuruoka wrote: > Thank you Erick. > > What you mentioned is right. > The two same documents were shown at the 3rd and 18th. > > So do you mean documents between the 3rd and the 18th (at least) in the > Lucene results have the same score? > > Cheers, > K > > > On Nov 2, 2009, at 9:59 PM, Erick Erickson wrote: > > What were their scores? I'm assuming that by "rank" you mean >> the order in which the documents were returned, not the raw Lucene >> score. >> >> Lucene uses the insertion order to break ties. That is, two documents >> with the same score will the appear in the order of their (internal) >> Lucene doc ID. >> >> So is it possible that *all* of the documents that appear between these >> two have the exact same score for that query? That seems a bit >> unlikely, but it's worth checking before going much further..... >> >> Best >> Erick >> >> On Mon, Nov 2, 2009 at 7:45 AM, kenji tsuruoka > >wrote: >> >> Dear. Lucene users. >>> >>> Hi. >>> I have tried to index and search MEDLINE abstracts by LUCENE. >>> >>> And there were some problems in the search results. >>> That is Lucene has assigned different ranks for the exactly same >>> documents. >>> >>> I didn't know the input documents for the index contain duplicate >>> documents >>> at the first time. >>> I have solve the problem by making all input documents UNIQUE for the >>> index. >>> >>> But I want to know how and why the situation was happened. >>> >>> The duplicate document is as follows: >>> >>> _pubmed_id=13029105:1952Nov15 >>> _ArticleTitle_ >>> Experimental diabetes and clinical diabetes. >>> _pubmed_id_end_ >>> >>> There are TWO exactly same documents in "index". >>> And their rankings by Lucene are 3 and 18. >>> >>> I have known texts in XML/HTML data should be extracted before indexing. >>> Anyway, I haven't done this work now. >>> >>> Please let me know the reason why the same documents were shown different >>> ranks. >>> >>> Best, >>> K >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --005045018102d46d0104776334d3--