lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject Re: SIPs and CAPs
Date Thu, 14 Jul 2005 11:17:21 GMT
I've done this by comparing term frequency in a subset
(in Amazon's case a single book) and looking for a
significant "uplift" in term popularity vs that of the
general corpus popularity. Practically speaking, in
the amazon case you can treat each page in the example
book as a Lucene document, create a RAMDirectory and
then use it's TermEnum to get the docFreqs for all
words and compare them with the corpus docFreqs.

The "uplift" score for each term is

Take the top "n" terms scored by the above then
analyze the text of the subset looking for runs of
these terms.

I have some code for this that I have wanted to
package up as a contribution for some time.

Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message