lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <>
Subject RE: Obtaining IDF values for the terms in a document set
Date Thu, 15 Dec 2011 20:58:46 GMT
Hi Mike,

If you just need the IDF you can run in contrib against either your sample
index or your index to get the N terms with the highest DF values (i.e. lowest IDF.) If you
have a large index, giving it lots of memory seems to help.

Depending on your use case, you may instead want to run it with the "-t" flag which will get
the terms with the highest total occurrences (total tf), which is a good measure of the size
of the positions list for those terms.  The size of the positions list only matters if you
allow phrase or proximity queries.


Regarding the positions list and slow phrase queries see:

You can also look at the standard stop word sets at  (look under the entries for each stemmer)

Tom Burton-West

-----Original Message-----
From: Mike O'Leary [] 
Sent: Thursday, December 15, 2011 12:34 PM
Subject: Obtaining IDF values for the terms in a document set

We have a large set of documents that we would like to index with a customized stopword list.
We have run tests by indexing a random set of about 10% of the documents, and we'd like to
generate a list of the terms in that smaller set and their IDF values as a way to create a
starter set of stopwords for the larger document set by selecting the terms that have the
lowest IDF values. First of all, is this the best way to create a stopword list? Second, is
there a straightforward way to generate a list of terms and their IDF values from a Lucene

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message