lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Smith" <>
Subject de-boosting fields
Date Sat, 09 Dec 2006 01:25:45 GMT
I have a collection of documents for which I've always returned the
results sorted on the date/time of the document (using a sort object in
the search method on my Searcher).  It works great.


Suddenly, I have a requirement to return the documents in relevancy
order.  So, that's easy (I thought); simply call search() without a sort
object.  Unfortunately, the results I got were not what I expected.  So,
I added some code to have lucene explain how it was getting the score
and then things became clearer.  


Each document has all of the words in the document indexed in a field
called "Body" (vanilla unstored, indexed field).  However, there is also
some category information which is kept in a keyword field called
"Category".  A document may belong to a large number of categories


When I search, I generate a query which says "give me all of the
documents, in relevancy order, which contain one or more of the
following words: word1, word2, word 3-and it also must be in at least
one of the following categories: category1, category2, ..., categoryN.


What I found was that lucene was using the category information as part
of what it uses to compute the relevancy score (in hindsight, not too
surprising).  The problem is that the numbers from the category hits in
"Category" overwhelm the numbers from the word hits in the "Body".  So,
my most relevant document may only have a single word hit and a document
way down in the list (in terms of relevancy) might have a number of word
hits.  For example, in one search, the top scoring document scored
.2650.  Of that, the category information contributed .2635 to that
score-meaning the word hits only contributed .0015 to the relevancy.
This is the opposite of what I want.


I'd be happy to simply eliminate the category information from the score
computation all together (base relevancy scores only on the words which
hit in the "Body" field).  Another solution would be to change the boost
on the category information to some small number (zero?) or raise the
Body field boost to a much larger number or both.


What is the best way to do this?  Is changing the boost the right
answer?  Can a field's boost be zero?  Is there a way to write a custom
scorer that gets inserted somewhere?  Any suggestions would be





  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message