lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike O'Leary" <tmole...@uw.edu>
Subject RE: Obtaining IDF values for the terms in a document set
Date Thu, 15 Dec 2011 20:54:52 GMT
Hi Simon,
I guess in a sense we are interested in obtaining a list of the top N terms, but they would
be the top terms in the sense that they have the lowest IDF values. These would be the terms
that appear in all or almost all documents in the document set. This is not a count of the
number of term occurrences in documents, it is a count of documents that contain at least
one occurrence of a given term. Lucene must be storing IDF values for the terms of a document
set somewhere in order to compute TF/IDF values when searching. I am wondering if there is
an easy way to iterate through all of the terms that occur in the document set and obtain
their IDF values.
Thanks,
Mike

-----Original Message-----
From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] 
Sent: Thursday, December 15, 2011 11:44 AM
To: java-user@lucene.apache.org
Subject: Re: Obtaining IDF values for the terms in a document set

On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary <tmoleary@uw.edu> wrote:
> We have a large set of documents that we would like to index with a customized stopword
list. We have run tests by indexing a random set of about 10% of the documents, and we'd like
to generate a list of the terms in that smaller set and their IDF values as a way to create
a starter set of stopwords for the larger document set by selecting the terms that have the
lowest IDF values. First of all, is this the best way to create a stopword list? Second, is
there a straightforward way to generate a list of terms and their IDF values from a Lucene
index?
> Thanks,
> Mike

hey mike,

I can certainly help you with generating the list of your top N terms, if that is the best
or right way to generate the stopwords list I am not sure but maybe somebody else will step
up.

to get the top N terms out of your index you can simply iterate the terms in a field and put
the top N terms based on the docFreq() on a heap. something like this:

     static class TermAndDF {
       String term;
       int df;
     }
     int queueSize = N;
     PriorityQueue<TermAndDF> queue = ...

     final TermEnum termEnum = reader.terms(new Term(field));
      try {
        do {
          final Term term = termEnum.term();
          if (term == null || term.field() != field) break;
          int docFreq = termEnum.docFreq();
          if (queue.size() < queueSize) {
             queue.add(new TermAndDF(term.text(), docFreq);
          } else if (queue.top().df < docFreq) {
             TermAndFreq tnFrq = queue.top();
             tnFrq.term = term.text();
             tnFrq.df = docFreq;
          }
        } while (termEnum.next());
      } finally {
        termEnum.close();
      }

another way of doing it is to use index pruning and drop terms with docFreq above a threshold
after you have indexed your doc set.

simon

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message