lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: de-boosting fields
Date Sat, 09 Dec 2006 21:00:42 GMT
I meant search this mail archive....

Erick

On 12/9/06, Scott Smith <ssmith@mainstreamdata.com> wrote:
>
> I've googled for custom scorers and haven't found anything.  If anyone can
> point me to some posts, that would be appreciated.
>
> Sounds like setting the boost to zero works (see Daniel Naber's post), but
> that seems like overkill.
>
> I'll take a look at filters as well.
>
> ________________________________
>
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Fri 12/8/2006 7:06 PM
> To: java-user@lucene.apache.org
> Subject: Re: de-boosting fields
>
>
>
> I've certainly seen references to writing custom scorers, so it's
> possible.
> you might find valuable hints by searching the mail archive. I'll leave it
> to the more expert folks to suggest which is your best option.
>
> Although (and I'm talking beyond my competence here), it *may* work for
> you
> to assemble a Filter for the category part of your query and use that
> instead of including the category in your query. As I understand it,
> filters
> don't contribute (or all contribute identically) to the score, leaving the
> search you're doing on body to determine your relevance, which seems like
> what you're after. Filters even work with something called a
> ConstantScoreQuery as I remember, which is a hint <G>.
>
> But again, don't be surprised if one of the more expert folks comes up
> with
> a *much* better idea <G>
>
> Best
> Erick
>
>
>
> On 12/8/06, Scott Smith <ssmith@mainstreamdata.com> wrote:
> >
> > I have a collection of documents for which I've always returned the
> > results sorted on the date/time of the document (using a sort object in
> > the search method on my Searcher).  It works great.
> >
> >
> >
> > Suddenly, I have a requirement to return the documents in relevancy
> > order.  So, that's easy (I thought); simply call search() without a sort
> > object.  Unfortunately, the results I got were not what I expected.  So,
> > I added some code to have lucene explain how it was getting the score
> > and then things became clearer.
> >
> >
> >
> > Each document has all of the words in the document indexed in a field
> > called "Body" (vanilla unstored, indexed field).  However, there is also
> > some category information which is kept in a keyword field called
> > "Category".  A document may belong to a large number of categories
> > (10-70).
> >
> >
> >
> > When I search, I generate a query which says "give me all of the
> > documents, in relevancy order, which contain one or more of the
> > following words: word1, word2, word 3-and it also must be in at least
> > one of the following categories: category1, category2, ..., categoryN.
> >
> >
> >
> > What I found was that lucene was using the category information as part
> > of what it uses to compute the relevancy score (in hindsight, not too
> > surprising).  The problem is that the numbers from the category hits in
> > "Category" overwhelm the numbers from the word hits in the "Body".  So,
> > my most relevant document may only have a single word hit and a document
> > way down in the list (in terms of relevancy) might have a number of word
> > hits.  For example, in one search, the top scoring document scored
> > .2650.  Of that, the category information contributed .2635 to that
> > score-meaning the word hits only contributed .0015 to the relevancy.
> > This is the opposite of what I want.
> >
> >
> >
> > I'd be happy to simply eliminate the category information from the score
> > computation all together (base relevancy scores only on the words which
> > hit in the "Body" field).  Another solution would be to change the boost
> > on the category information to some small number (zero?) or raise the
> > Body field boost to a much larger number or both.
> >
> >
> >
> > What is the best way to do this?  Is changing the boost the right
> > answer?  Can a field's boost be zero?  Is there a way to write a custom
> > scorer that gets inserted somewhere?  Any suggestions would be
> > appreciated.
> >
> >
> >
> > Scott
> >
> >
> >
> >
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message