lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Fri, 14 Nov 2003 19:09:32 GMT
On Friday, November 14, 2003, at 02:02  PM, Chong, Herb wrote:
> if i just run this query against a million document newswire index, i 
> know i am going to get lots of hits. the phrase "capital gains tax" 
> hits a lot fewer documents, but is overrestrictive. the fact that the 
> three terms occur next to each other in the query means that documents 
> with the three terms far apart should not get nearly as much weight in 
> the ranking scheme. a sentence ending with two terms "capital gains" 
> followed by a sentence starting with the term "tax" should not be a 
> highly ranked match. that means you need sentence boundaries in the 
> index. the indexing and the query analysis scheme has to understand 
> the linguistic concept of a phrase, and phrases do not cross sentence 
> boundaries.

With Lucene's analysis process, you can assign a position increment to 
tokens.  The default value is 1, meaning its the next position.  Phrase 
queries default to a slop of 0, meaning they must be in successive 
positions.  When analyzing and you encounter a sentence boundary, you 
could set the position increment of the next word (the first word of 
the next sentence) to a high number (to account for users searching 
with potential slop, or just something greater than one if you never 
use sloppy phrase searches).

Does this get you closer to what you're after?

As for how to weight queries by the distance from terms.... I'll have 
to think on that some, but I suspect something reasonable could be done 
with a custom Similarity or a custom type of Query.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message