lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chong, Herb" <HCho...@bloomberg.com>
Subject RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Mon, 17 Nov 2003 14:03:36 GMT
now you're talking. this is one way of doing it. you need to work out a heuristic to increment
the counter enough that a misrecognized long sentence won't trigger this. however, one can
argue that a sentence that contains 1000 words can't possibly be about one topic.

Herb....

-----Original Message-----
From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
Sent: Saturday, November 15, 2003 7:16 AM
To: Lucene Users List
Subject: AW: inter-term correlation [was Re: Vector Space Model in
Lucene?]

Anyway, Herb is right, sentence boundaries do carry a meaning and the 
linguistic rule could be phrased as: "Constituents (Concepts) mentioned 
in one sentence together have a closer relation than those that are not."

I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries - I 
assume that a sentence with 100 words would be pretty unlikely,
but still within the 1000 word separation done by increasing the
position. No linguistics necessary, actually, but it is an application
of a linguistic rule!

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message