lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Fri, 14 Nov 2003 20:14:57 GMT
Chong, Herb wrote:
> since i am working now on financial news, here is an example:
> capital gains tax
> if i just run this query against a million document newswire index, i know i am going
to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive.
the fact that the three terms occur next to each other in the query means that documents with
the three terms far apart should not get nearly as much weight in the ranking scheme. a sentence
ending with two terms "capital gains" followed by a sentence starting with the term "tax"
should not be a highly ranked match. that means you need sentence boundaries in the index.
the indexing and the query analysis scheme has to understand the linguistic concept of a phrase,
and phrases do not cross sentence boundaries.

Have sentence boundaries actually proven to be that userful in this sort 
of thing.  For example, if the text were something like:

   "Such sales would be considered long term capital gains.  Tax on 
these is 20%."

Then penalizing for the sentence boundary wouldn't be valid.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message