lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Obtaining IDF values for the terms in a document set
Date Thu, 15 Dec 2011 19:43:43 GMT
On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
> We have a large set of documents that we would like to index with a customized stopword
list. We have run tests by indexing a random set of about 10% of the documents, and we'd like
to generate a list of the terms in that smaller set and their IDF values as a way to create
a starter set of stopwords for the larger document set by selecting the terms that have the
lowest IDF values. First of all, is this the best way to create a stopword list? Second, is
there a straightforward way to generate a list of terms and their IDF values from a Lucene
index?
> Thanks,
> Mike

hey mike,

I can certainly help you with generating the list of your top N terms,
if that is the best or right way to generate the stopwords list I am
not sure but maybe somebody else will step up.

to get the top N terms out of your index you can simply iterate the
terms in a field and put the top N terms based on the docFreq() on a
heap. something like this:

     static class TermAndDF {
       String term;
       int df;
     }
     int queueSize = N;
     PriorityQueue<TermAndDF> queue = ...

     final TermEnum termEnum = reader.terms(new Term(field));
      try {
        do {
          final Term term = termEnum.term();
          if (term == null || term.field() != field) break;
          int docFreq = termEnum.docFreq();
          if (queue.size() < queueSize) {
             queue.add(new TermAndDF(term.text(), docFreq);
          } else if (queue.top().df < docFreq) {
             TermAndFreq tnFrq = queue.top();
             tnFrq.term = term.text();
             tnFrq.df = docFreq;
          }
        } while (termEnum.next());
      } finally {
        termEnum.close();
      }

another way of doing it is to use index pruning and drop terms with
docFreq above a threshold after you have indexed your doc set.

simon

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message