lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject RE: Obtaining IDF values for the terms in a document set
Date Thu, 15 Dec 2011 20:58:46 GMT
Hi Mike,

If you just need the IDF you can run HighFreqTerm.java in contrib against either your sample
index or your index to get the N terms with the highest DF values (i.e. lowest IDF.) If you
have a large index, giving it lots of memory seems to help.

Depending on your use case, you may instead want to run it with the "-t" flag which will get
the terms with the highest total occurrences (total tf), which is a good measure of the size
of the positions list for those terms.  The size of the positions list only matters if you
allow phrase or proximity queries.

See:
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=markup

Regarding the positions list and slow phrase queries see:
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

You can also look at the standard stop word sets at
http://snowball.tartarus.org/  (look under the entries for each stemmer)
or http://search.cpan.org/~creamyg/Lingua-StopWords-0.09/
or http://members.unine.ch/jacques.savoy/clef/index.html

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-----Original Message-----
From: Mike O'Leary [mailto:tmoleary@uw.edu] 
Sent: Thursday, December 15, 2011 12:34 PM
To: java-user@lucene.apache.org
Subject: Obtaining IDF values for the terms in a document set

We have a large set of documents that we would like to index with a customized stopword list.
We have run tests by indexing a random set of about 10% of the documents, and we'd like to
generate a list of the terms in that smaller set and their IDF values as a way to create a
starter set of stopwords for the larger document set by selecting the terms that have the
lowest IDF values. First of all, is this the best way to create a stopword list? Second, is
there a straightforward way to generate a list of terms and their IDF values from a Lucene
index?
Thanks,
Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message