lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: SIPs and CAPs
Date Thu, 14 Jul 2005 11:17:21 GMT
I've done this by comparing term frequency in a subset
(in Amazon's case a single book) and looking for a
significant "uplift" in term popularity vs that of the
general corpus popularity. Practically speaking, in
the amazon case you can treat each page in the example
book as a Lucene document, create a RAMDirectory and
then use it's TermEnum to get the docFreqs for all
words and compare them with the corpus docFreqs.

The "uplift" score for each term is
(subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs)

Take the top "n" terms scored by the above then
analyze the text of the subset looking for runs of
these terms.

I have some code for this that I have wanted to
package up as a contribution for some time.


	
	
		
___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message