lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: easy way to figure out most common tokens?
Date Wed, 15 Aug 2012 18:47:40 GMT
If you found the terms to remove with e.g. HighFreqTerms, you can use the
abstract class FilterIndexReader (FilterAtomicReader in Lucene 4.0) to code
a filter for the term dictionary (just return a filtered TermEnum) on
merging. Just wrap an IndexReader with this FilterIndexReader that hides the
terms and then do IndexWriter.addIndexes(filteredReader) to a new, empty
index. This still needs time, but maybe better than reindexing.

Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

> -----Original Message-----
> From: Shaya Potter []
> Sent: Wednesday, August 15, 2012 8:43 PM
> To:
> Cc: Erick Erickson
> Subject: Re: easy way to figure out most common tokens?
> On 08/15/2012 02:29 PM, Erick Erickson wrote:
> > I don't see how you could without indexing everything first since you
> > can't know what the most frequent terms until you've processed all
> > your documents....
> exactly
> > If you know these terms in advance, it seems like you could just call
> > then stopwords and use the common stopword processing.
> >
> > If you have to examine your corpus in the first place, it seems like
> > you could do something with term frequencies to extract the most
> > common terms from your index then re-index all your data with those
> > terms as stopwords..
> its a possibility, but that would require reindexing, which would take a
> time, hence my desire to try and edit the individual documents.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message