lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chong, Herb" <HCho...@bloomberg.com>
Subject RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Fri, 14 Nov 2003 19:02:37 GMT
since i am working now on financial news, here is an example:

capital gains tax

if i just run this query against a million document newswire index, i know i am going to get
lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive.
the fact that the three terms occur next to each other in the query means that documents with
the three terms far apart should not get nearly as much weight in the ranking scheme. a sentence
ending with two terms "capital gains" followed by a sentence starting with the term "tax"
should not be a highly ranked match. that means you need sentence boundaries in the index.
the indexing and the query analysis scheme has to understand the linguistic concept of a phrase,
and phrases do not cross sentence boundaries.

Herb....

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
Sent: Friday, November 14, 2003 1:52 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

You mean if you have text like this: "Hello Herb.  Have a nice day!", 
you want to prevent phrase queries for "herb have"?  You could prevent 
sentence boundary crossing with clever use of the token position I 
suspect.  Would that accomplish what you're after?

Could you give a really dumbed down simple example of what you mean by 
inter-term correlation?

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message