lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <sar...@syr.edu>
Subject Re: multiple tokens at the same position
Date Fri, 25 May 2007 14:50:46 GMT
Hi Enis,

Enis Soztutar wrote:
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several tokens for each "word", but
> place all the tokens for the "word" at the same position with
> Token#setPositionIncrement(0).
> This way we could be able to search over this field using any
> form(stemmed, canonical, original) of the "word". Actually i have two
> questions here. First is that is there any way to avoid matching stemmed
> or canonical forms to a phrase query. Moreover it seems that adding
> multiple forms of the "word"s alters statistical calculations for
> scoring, especially for tf and idf, because the frequency of the root
> form of the word is incremented at each word with that root form. Is
> there any way that we could avoid it?

Answering both questions: Couldn't you just use a different field for
each form?

-- 
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message