lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: short documents = help me tweak Similarity??
Date Thu, 05 Apr 2007 18:13:08 GMT
It is the right forum, silence just means either no one knows the  
answer or no one who knows the answer has read it...  Such is the  
nature of the community.

Have you looked at overriding similarity with your own  
implementation?  Have you done explain() calls on the docs to see  
where the scores are coming from?  You may be seeing other factors at  
play.

You might also try searching the archives for length normalization.   
I seem to recall someone talking about the opposite problem, calling  
it "fair" similarity, so maybe you could use that as a basis for your  
implementation (by doing the opposite).

-Grant


On Apr 5, 2007, at 1:45 PM, John Kleven wrote:

> Sorry to re-post -- is this the correct forum for questions like  
> this?  I
> think that writing a new encode/decode operation should help  
> alleviate my
> problem, but thought that this must be fairly widespread issue for  
> people
> using lucene for "non-web-page" searches (i.e., shorter documents)
>
> Thanks again,
> John
>
> On 4/2/07, John Kleven <johnkleven@gmail.com> wrote:
>>
>> My documents are cars...
>> i.e.,
>> Nissan Altima Sports Package
>> Nissan Altima Standard
>>
>> The problem I have is when i search "Nissan Altima", I want to get  
>> the 2nd
>> hit back first, i.e. "Nissan Altima Standard", because it is shorter.
>> However, this doesn't happen.  They are both scored the exact same.
>>
>> I know that the lengthNorm in Similarity is using 1/sqrt 
>> (numTerms), and
>> you would think that would be enuff to make sure the order is  
>> correct.
>> However, it is not, and I assume this is because of the encode/decode
>> functions that pack this value into a single byte do not have the
>> granularity to represent differences between numbers like 1/sqrt 
>> (3) vs
>> 1/sqrt(4)??
>>
>> Is the suggested approach here to re-write the encode/decode  
>> operations,
>> or is there any easier way?
>>
>> Thanks kindly -
>> John

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message