lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alf Eaton <li...@hubmed.org>
Subject Re: Stemmed terms/common terms
Date Thu, 16 Aug 2007 16:13:48 GMT

On 16 Aug 2007, at 17:06, Grant Ingersoll wrote:

>
> On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:
>
>> A couple of questions about term frequencies and stemming:
>>
>> - What's the best way to get the most common unstemmed form of a  
>> Porter-stemmed word from the index? For example given the stem  
>> 'walk', find that 'walking' is the most common full word in the  
>> index.
>
> Are both in the index?  I would think this is going to take some  
> application specific logic, since Lucene doesn't inherently track  
> these relations.  You might be able to string something together  
> using some of the regular expression/wildcard queries, but it is  
> going to take some work on your part.

Hmm, no - the stemmed token is indexed and the full field is stored.  
I guess that means running a search for the stem and then using the  
same logic as a highlighter to find and extract the actual terms from  
each document.

> Another approach might be to put some mechanisms in place during  
> analysis that track this information.

How would you recommend doing this - using positionIncrement to store  
the stem and the original word at the same position, perhaps?

alf.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message