lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar" <enis.soz.nu...@gmail.com>
Subject Re: multiple tokens at the same position
Date Fri, 25 May 2007 15:03:21 GMT
Yes, indeed we could but it brings other problems, for example increasing
the index size, and extending the query to search for multiple fields, etc.

On 5/25/07, Steven Rowe <sarowe@syr.edu> wrote:
>
> Hi Enis,
>
> Enis Soztutar wrote:
> > In nutch we have a use case in which we need to store tokens with their
> > original text plus their stemmed form plus their canonical form(through
> > some asciifization). From my understanding of lucene, it makes sense to
> > write a tokenstream which generates several tokens for each "word", but
> > place all the tokens for the "word" at the same position with
> > Token#setPositionIncrement(0).
> > This way we could be able to search over this field using any
> > form(stemmed, canonical, original) of the "word". Actually i have two
> > questions here. First is that is there any way to avoid matching stemmed
> > or canonical forms to a phrase query. Moreover it seems that adding
> > multiple forms of the "word"s alters statistical calculations for
> > scoring, especially for tf and idf, because the frequency of the root
> > form of the word is incremented at each word with that root form. Is
> > there any way that we could avoid it?
>
> Answering both questions: Couldn't you just use a different field for
> each form?
>
> --
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message