lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Different score for the same documents
Date Mon, 02 Nov 2009 13:17:58 GMT
That's exactly the question. If all 16 documents have exactly the same
score, then
the internal tie-breaking is your answer. They would also all have strictly
increasing doc IDs.

But I'd check to see the scores before accepting this explanation because
I find it unlikely that all 16 docs have identical scores. But it's worth
checking
out before looking for more complex answers.....

Best
Erick

On Mon, Nov 2, 2009 at 8:11 AM, kenji tsuruoka <kenji.tsuruoka@gmail.com>wrote:

> Thank you Erick.
>
> What you mentioned is right.
> The two same documents were shown at the 3rd and 18th.
>
> So do you mean documents between the 3rd and the 18th (at least) in the
> Lucene results have the same score?
>
> Cheers,
> K
>
>
> On Nov 2, 2009, at 9:59 PM, Erick Erickson wrote:
>
>  What were their scores? I'm assuming that by "rank" you mean
>> the order in which the documents were returned, not the raw Lucene
>> score.
>>
>> Lucene uses the insertion order to break ties. That is, two documents
>> with the same score will the appear in the order of their (internal)
>> Lucene doc ID.
>>
>> So is it possible that *all* of the documents that appear between these
>> two have the exact same score for that query? That seems a bit
>> unlikely, but it's worth checking before going much further.....
>>
>> Best
>> Erick
>>
>> On Mon, Nov 2, 2009 at 7:45 AM, kenji tsuruoka <kenji.tsuruoka@gmail.com
>> >wrote:
>>
>>  Dear. Lucene users.
>>>
>>> Hi.
>>> I have tried to index and search MEDLINE abstracts by LUCENE.
>>>
>>> And there were some problems in the search results.
>>> That is Lucene has assigned different ranks for the exactly same
>>> documents.
>>>
>>> I didn't know the input documents for the index contain duplicate
>>> documents
>>> at the first time.
>>> I have solve the problem by making all input documents UNIQUE for the
>>> index.
>>>
>>> But I want to know how and why the situation was happened.
>>>
>>> The duplicate document is as follows:
>>>
>>> _pubmed_id=13029105:1952Nov15
>>> _ArticleTitle_
>>> <s n="1">Experimental diabetes and clinical diabetes.</s>
>>> _pubmed_id_end_
>>>
>>> There are TWO exactly same documents in "index".
>>> And their rankings by Lucene are 3 and 18.
>>>
>>> I have known texts in XML/HTML data should be extracted before indexing.
>>> Anyway, I haven't done this work now.
>>>
>>> Please let me know the reason why the same documents were shown different
>>> ranks.
>>>
>>> Best,
>>> K
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message