lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Volkman <jvolk...@gmail.com>
Subject Base score to use for custom query?
Date Sat, 24 Apr 2010 03:40:42 GMT
I have a situation similar to the following that I'm trying to solve:

I have a field in my document that contains a range of numbers. Say, for
example, the universe of numbers is the range of integers from 0-100. My
field represents a subrange of those numbers in a token stream. So, for
example, if one document contains 20-30, it's token stream contains the
terms [20, 21, 22, ..., 29]. Now I can quickly find all documents that
contain some number.

The next part of the problem is searching for all documents that intersect
with some subrange of numbers. Somewhat like a range query, but not exactly.
Say I want to search for all documents that touch the range [10, 30]. My
original implementation was to simply create a BooleanQuery full of
TermQuerys for each term in the range i was searching for. While this
returned the proper results, it did so with skewed scores. I'd prefer
documents containing numbers towards the beginning of my search range to be
scored higher than docs towards the end. So, if I had two documents, one
with 10-20, and one with 20-30, and I searched for [19,30], both documents
would be returned, but the second would be much more highly scored due to
its higher number of matched terms.

So, my plan is to write a custom query which matches documents documents in
my range in a way such as:

for (term : queryRange) {
TermDocs td = searcher.termDocs(term);
while (td.next()) {
...
}
}

And for each document, set the score to some vale proportional to the
matching term's distance from the beginning of the queried range.

My question is: what score should I start at, and what score should I end
at? If i assume that all documents matching the first term in my queried
range have score scoreMax, and all documents matching the last term have
scoreMin, and all documents matching in-between terms have a score between
scoreMax and scoreMin proportional to where they fall within the range, what
should scoreMax and scoreMin be?

My current thought is to start with the value passed to my Weight's
normalize() method, and work down to 0.0.

Thanks,
Jeremy

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message