lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "daniel rosher" <daniel.ros...@hotonline.com>
Subject Lucene scoring and short fields
Date Thu, 07 Feb 2008 10:15:28 GMT
Hi All,

Given that Lucene scoring can favour shorter fields in documents, in the
past we've had to pad out 'unreasonably' short fields to a set minimum
(with basically nonsense words), I'm wondering how others might have
dealt with this issue.

Another option is to have a custom Similarity class with an altered
lengthNorm method?

Cheers,
Dan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

From:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
score(q,d)   =  coord(q,d)  ·  queryNorm(q)  ·  SUM(  tf(t in d)  ·
idf(t)2  ·  t.getBoost() ·  norm(t,d)  )

Given one term query, and the term found in two documents doc{a},
doc{b}(with no boost on field, doc or query term)

score(q,d)   =~  SUM (  tf(t in d)  ·  norm(t,d)  )

and for one term:
score(q,d)   =~   tf(t in d)  ·  norm(t,d)  

also:
norm(t,d) =~ lengthNorm(field) 

lengthNorm(field) :
computed when the document is added to the index in accordance with the
number of tokens of this field in the document, so that shorter fields
contribute more to the score

in DefaultSimilarity.java

lengthNorm(field)  = 1/sqrt(num_terms_in_field)

doc{a} field{a} num_terms_in_field = 100, term appears 10 times in
field{a},doc{a}
score =~ 10/sqrt(100) = 1
doc{b} field{a} num_terms_in_field = 300, term appears 10 times in
field{a},doc{a}
score =~ 10/sqrt(300) = 0.577350269
Daniel Rosher
Developer


d: 0207 3489 912
t: 0870 2020 121
f: 0870 2020 131
m: 
http://www.hotonline.com/






- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - -
This message is sent in confidence for the addressee only. It may contain privileged 
information. The contents are not to be disclosed to anyone other than the addressee. 
Unauthorised recipients are requested to preserve this confidentiality and to advise 
us of any errors in transmission. Thank you.

hotonline ltd is registered in England & Wales. Registered office: One Canada Square,

Canary Wharf, London E14 5AP. Registered No: 1904765.


This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message