lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: SIPs and CAPs
Date Thu, 14 Jul 2005 12:31:55 GMT

On Jul 14, 2005, at 7:17 AM, mark harwood wrote:
> I've done this by comparing term frequency in a subset
> (in Amazon's case a single book) and looking for a
> significant "uplift" in term popularity vs that of the
> general corpus popularity. Practically speaking, in
> the amazon case you can treat each page in the example
> book as a Lucene document, create a RAMDirectory and
> then use it's TermEnum to get the docFreqs for all
> words and compare them with the corpus docFreqs.
> The "uplift" score for each term is
> (subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs)
> Take the top "n" terms scored by the above then
> analyze the text of the subset looking for runs of
> these terms.
> I have some code for this that I have wanted to
> package up as a contribution for some time.


Do you just do this with terms or do you also extract phrases?   
Phrases would be more intensive to deal with since positional  
information is needed as well as some rules to decide on minimum/ 
maximum length of phrases and such.  Perhaps the technique you  
describe would be useful in locating spots to dig into for phrases?

As for CAP's, perhaps a specialized TokenFilter could be used to do  
this during the indexing analysis step - I don't think it would be  


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message