On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
> We have a large set of documents that we would like to index with a customized stopword
list. We have run tests by indexing a random set of about 10% of the documents, and we'd like
to generate a list of the terms in that smaller set and their IDF values as a way to create
a starter set of stopwords for the larger document set by selecting the terms that have the
lowest IDF values. First of all, is this the best way to create a stopword list? Second, is
there a straightforward way to generate a list of terms and their IDF values from a Lucene
index?
> Thanks,
> Mike
hey mike,
I can certainly help you with generating the list of your top N terms,
if that is the best or right way to generate the stopwords list I am
not sure but maybe somebody else will step up.
to get the top N terms out of your index you can simply iterate the
terms in a field and put the top N terms based on the docFreq() on a
heap. something like this:
static class TermAndDF {
String term;
int df;
}
int queueSize = N;
PriorityQueue<TermAndDF> queue = ...
final TermEnum termEnum = reader.terms(new Term(field));
try {
do {
final Term term = termEnum.term();
if (term == null  term.field() != field) break;
int docFreq = termEnum.docFreq();
if (queue.size() < queueSize) {
queue.add(new TermAndDF(term.text(), docFreq);
} else if (queue.top().df < docFreq) {
TermAndFreq tnFrq = queue.top();
tnFrq.term = term.text();
tnFrq.df = docFreq;
}
} while (termEnum.next());
} finally {
termEnum.close();
}
another way of doing it is to use index pruning and drop terms with
docFreq above a threshold after you have indexed your doc set.
simon

To unsubscribe, email: javauserunsubscribe@lucene.apache.org
For additional commands, email: javauserhelp@lucene.apache.org
