lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Term Boost Threshold
Date Fri, 13 Nov 2009 22:16:13 GMT
Hi Max,

  You want a query like

("San Francisco" OR "California") AND ("John Smith" OR "John Smith
Manufacturing")

  essentially?  You can give Lucene exactly this query and it will require
that
either "John Smith" or "John Smith Manufacturing" be present, but will score
results which have these and one or more of San Fran or CA higher.  And in
fact will score highest results which match all terms.

  Does that help?

  -jake

On Fri, Nov 13, 2009 at 2:09 PM, Max Lynch <ihasmax@gmail.com> wrote:

> Hi,
> I am trying to move from a system where I counted the frequency of terms by
> hand in a highlighter to determine if a result was useful to me.  In an
> earlier post on this list someone suggested I could boost the terms that
> are
> useful to me and only accept hits above a certain threshold.  However, in
> my
> tests, I can't seem to find a deterministic way of calculating a threshold.
>
> Here is an example of what I mean:
> My query: "John Smith" "John Smith Manufacturing" "San Francisco"
> "California"
>
> Results are only useful to me if they contain the first term "John Smith"
> and/or the second term "John Smith Manufacturing" or any combination with
> the other San Fran and California terms.  However, results with just "San
> Francisco" or "California" can be ignored.
>
> I tried something like "John Smith"^200 "John Smith Manufacturing"^100 "San
> Francisco"^2 "California"^1
>
> But I can't seem to find a good method of calculating a cut-off score and
> filtering out the results that are only San Fran or California using the
> term boosting and resulting score.  I also don't care about frequency,
> meaning that I want the result even if John Smith occurs once, and I don't
> want a document with "San Francisco" a million times to score higher than
> the single result for John Smith.
>
> Sorry if that's confusing.
>
> Any ideas?
>
> Thanks,
> Max
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message