lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Solution for "unstemming" terms
Date Thu, 18 Oct 2001 21:03:15 GMT
I've found a pretty good solution for retrieving un-stemmed version of 
index terms, in case anyone is interested. This uses only the features 
already in 1.2-rc1 release.

The trick is to create an additional field on each document (say "dict" 
for dictionary) and set it to contain a list of space-separated strings 
like this:

    cat:cats likeli:likeley

And so on. So each term is composed of the stem, ':' and the unstemmed 
token. I had to create a custom Tokenizer that would split this string 
on spaces alone and not split the words at the ':' position. But there 
may be a different charachter that would work fine for one of the 
standard tokenizers.

When you need to retrieve all unstemmed forms for a particular stem, you 
simply open up a TermEnum for a term <dict:stem:> like this:
    TermEnum te = reader.terms(new Term("dict", stem + ':'));

The you just read the first one or all of the ones that startWith your 
stem. This works very fast because TermEnums are fast. You even get the 
unstemmed forms in a sorted order for free!

- Dmitry



Mime
View raw message