lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chong, Herb" <>
Subject RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Fri, 14 Nov 2003 19:29:15 GMT
something in the index needs to mark sentence boundaries so that words close together in the
query and found close together in the text get penalized for being separated by a boundary.
proximity isn't enough because in more complex queries, some of the intermediate words might
be omitted. in other words, A near B near C works only when A, B, and C occur close together
and fails when B is omitted, but in English, A or B sometimes do get omitted after the first
mention of the phrase A B C.

as far as the ranking calculation goes, the distance itself isn't a direct factor. put it
this way, imagine you have a sliding window on the query of some number of terms. suppose
you have a 5 term query. an exact match of the 5 terms on the window ought to match higher
than 4 out of 5. similarly, matching 4 terms out of 5 with the 5th term nearby but in another
sentence should not rank as high. 5, BTW, is the magic number for English and other languages
that use the same rules for multiword term composition.

read this - read the section on Lexical
Affinities. also find this paper: Y. Maarek and F. Smadja. Full text indexing based on lexical
relations: An application: Software libraries. In Proceedings of the 12th International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages 198--206, 1989.

-----Original Message-----
From: Erik Hatcher []
Sent: Friday, November 14, 2003 2:10 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

With Lucene's analysis process, you can assign a position increment to 
tokens.  The default value is 1, meaning its the next position.  Phrase 
queries default to a slop of 0, meaning they must be in successive 
positions.  When analyzing and you encounter a sentence boundary, you 
could set the position increment of the next word (the first word of 
the next sentence) to a high number (to account for users searching 
with potential slop, or just something greater than one if you never 
use sloppy phrase searches).

Does this get you closer to what you're after?

As for how to weight queries by the distance from terms.... I'll have 
to think on that some, but I suspect something reasonable could be done 
with a custom Similarity or a custom type of Query.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message