lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaya Potter <spot...@gmail.com>
Subject Re: easy way to figure out most common tokens?
Date Wed, 15 Aug 2012 18:42:47 GMT
On 08/15/2012 02:29 PM, Erick Erickson wrote:
> I don't see how you could without indexing everything first
> since you can't know what the most frequent terms until
> you've processed all your documents....

exactly

> If you know these terms in advance, it seems like you could
> just call then stopwords and use the common stopword
> processing.
>
> If you have to examine your corpus in the first place,
> it seems like you could do something with term
> frequencies to extract the most common terms from
> your index then re-index all your data with those terms
> as stopwords..

its a possibility, but that would require reindexing, which would take a 
long time, hence my desire to try and edit the individual documents.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message