lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alf Eaton <>
Subject Re: Stemmed terms/common terms
Date Thu, 16 Aug 2007 16:13:48 GMT

On 16 Aug 2007, at 17:06, Grant Ingersoll wrote:

> On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:
>> A couple of questions about term frequencies and stemming:
>> - What's the best way to get the most common unstemmed form of a  
>> Porter-stemmed word from the index? For example given the stem  
>> 'walk', find that 'walking' is the most common full word in the  
>> index.
> Are both in the index?  I would think this is going to take some  
> application specific logic, since Lucene doesn't inherently track  
> these relations.  You might be able to string something together  
> using some of the regular expression/wildcard queries, but it is  
> going to take some work on your part.

Hmm, no - the stemmed token is indexed and the full field is stored.  
I guess that means running a search for the stem and then using the  
same logic as a highlighter to find and extract the actual terms from  
each document.

> Another approach might be to put some mechanisms in place during  
> analysis that track this information.

How would you recommend doing this - using positionIncrement to store  
the stem and the original word at the same position, perhaps?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message