lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maurits van Wijland" <m.vanwijl...@quicknet.nl>
Subject Re: Solution for "unstemming" terms
Date Fri, 19 Oct 2001 06:11:24 GMT
Dmitry,

Can you send us the code? This is very usefull!
I would like to experiment with this...

Maurits.
----- Original Message ----- 
From: "Dmitry Serebrennikov" <dmitrys@earthlink.net>
To: <lucene-dev@jakarta.apache.org>
Sent: Thursday, October 18, 2001 11:03 PM
Subject: Solution for "unstemming" terms


> I've found a pretty good solution for retrieving un-stemmed version of 
> index terms, in case anyone is interested. This uses only the features 
> already in 1.2-rc1 release.
> 
> The trick is to create an additional field on each document (say "dict" 
> for dictionary) and set it to contain a list of space-separated strings 
> like this:
> 
>     cat:cats likeli:likeley
> 
> And so on. So each term is composed of the stem, ':' and the unstemmed 
> token. I had to create a custom Tokenizer that would split this string 
> on spaces alone and not split the words at the ':' position. But there 
> may be a different charachter that would work fine for one of the 
> standard tokenizers.
> 
> When you need to retrieve all unstemmed forms for a particular stem, you 
> simply open up a TermEnum for a term <dict:stem:> like this:
>     TermEnum te = reader.terms(new Term("dict", stem + ':'));
> 
> The you just read the first one or all of the ones that startWith your 
> stem. This works very fast because TermEnums are fast. You even get the 
> unstemmed forms in a sorted order for free!
> 
> - Dmitry
> 
> 


Mime
View raw message