lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <>
Subject multiple tokens at the same position
Date Fri, 25 May 2007 14:43:54 GMT

In nutch we have a use case in which we need to store tokens with their 
original text plus their stemmed form plus their canonical form(through 
some asciifization). From my understanding of lucene, it makes sense to 
write a tokenstream which generates several tokens for each "word", but 
place all the tokens for the "word" at the same position with 
This way we could be able to search over this field using any 
form(stemmed, canonical, original) of the "word". Actually i have two 
questions here. First is that is there any way to avoid matching stemmed 
or canonical forms to a phrase query. Moreover it seems that adding 
multiple forms of the "word"s alters statistical calculations for 
scoring, especially for tf and idf, because the frequency of the root 
form of the word is incremented at each word with that root form. Is 
there any way that we could avoid it?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message