lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Stemmed terms/common terms
Date Thu, 16 Aug 2007 16:06:10 GMT

On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:

> A couple of questions about term frequencies and stemming:
> - What's the best way to get the most common unstemmed form of a  
> Porter-stemmed word from the index? For example given the stem  
> 'walk', find that 'walking' is the most common full word in the index.

Are both in the index?  I would think this is going to take some  
application specific logic, since Lucene doesn't inherently track  
these relations.  You might be able to string something together  
using some of the regular expression/wildcard queries, but it is  
going to take some work on your part.

Another approach might be to put some mechanisms in place during  
analysis that track this information.

> - Is there a way to get a list of all the terms in the index (or  
> maybe just the top n) ordered by descending frequency of usage? I  
> imagine it's related to docFreq, but can't see how to get a list of  
> terms in all documents.

Have a look at Luke if you just want the info as part of a UI.  Also,  
I _believe_ Solr has added a LukeRequestHandler (see http://, not sure if it does  
everything you are looking for, but it might be a place to start.   
You might ask your question on the Solr mailing list.

> I'm using PyLucene and Solr, so if there are easy solutions in  
> either of those that would be ideal.
> Thanks,
> alf.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message