lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: Token retrieval question
Date Fri, 12 Oct 2001 16:47:27 GMT


Anders Nielsen wrote:

>Can't you just keep 2 fields, one with the stemmed version of the text used
>for indexing purposes (index but not stored) and a second field with the
>original text (un-indexed but stored). Then when you know you got a match on
>the nth term in the stemmed version, you can use the same Analyzer but
>without the stemming on the stored text field, and take the nth term from
>that?
>
Yes, that is an option in some applications. Unfortunately, what I need 
to do involves collation of terms from many documents (those selected by 
some query). The implementation I've been using stored information in 
the document itself and then retrieved documents, re-parsed the 
information, and proceeded to collate the terms. The problem is that 
retrieving documents is comparatively slow and especially if they 
contain large amounts of data. As a result, this solution is not 
workable beyound say 1500 documents or so for real-time queries. So I'm 
looking for a better option.

What I may be able to do is to add term vector storage for documents and 
then have two fields: one indexed and tvstored with stemmed terms and 
another not indexed but tvstored with original words. This might workout 
because (hopefully) retrieval of termvectors would be faster than 
retrieval of documents.

>
>The only trouble I can see with that is if the stemmer either skips terms or
>makes two terms into one.
>
I've thought about this and the conclusion I came to is that we might 
want to separate term re-writing from stemming and treat them as 
distinct phases of the analizer's process. This would provide a nice 
framework for being able to handle languages that use composit words. An 
example would be in German (and I'm not myself a German speaker) when 
someone wants to say "black pen" they say it as one word. However, when 
searching for a black pen, they might search for "pen", regardless of 
the color of its ink. So, I'm thinking that the term re-writing phase 
would output the original term and any other terms that can be derived 
from it (using a dictionary lookup of some sort).

This stuff is longer term for me though, because our apps first priority 
is English, where these things occur but not as often.

>
>
>regards,
>Anders Nielsen
>
btw, I used to work with someone named Andrew Nelson :)



Mime
View raw message