lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Lynch <ihas...@gmail.com>
Subject Term Boost Threshold
Date Fri, 13 Nov 2009 22:09:30 GMT
Hi,
I am trying to move from a system where I counted the frequency of terms by
hand in a highlighter to determine if a result was useful to me.  In an
earlier post on this list someone suggested I could boost the terms that are
useful to me and only accept hits above a certain threshold.  However, in my
tests, I can't seem to find a deterministic way of calculating a threshold.

Here is an example of what I mean:
My query: "John Smith" "John Smith Manufacturing" "San Francisco"
"California"

Results are only useful to me if they contain the first term "John Smith"
and/or the second term "John Smith Manufacturing" or any combination with
the other San Fran and California terms.  However, results with just "San
Francisco" or "California" can be ignored.

I tried something like "John Smith"^200 "John Smith Manufacturing"^100 "San
Francisco"^2 "California"^1

But I can't seem to find a good method of calculating a cut-off score and
filtering out the results that are only San Fran or California using the
term boosting and resulting score.  I also don't care about frequency,
meaning that I want the result even if John Smith occurs once, and I don't
want a document with "San Francisco" a million times to score higher than
the single result for John Smith.

Sorry if that's confusing.

Any ideas?

Thanks,
Max

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message