lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: multiple tokens at the same position
Date Fri, 25 May 2007 17:07:32 GMT
I can only speak to the " avoid matching stemmed
or canonical forms" part...

Yes, but you've got to do some fancy dancing when you index,
something like adding a special signifier to, say, the original token.
I'll ignore the canonical part of your question for the sake of
brevity.


Consider indexing "running"
You'd index "run" and "running$".

Now, whenever you care about the original token, you append
the '$' to the term and search for that.

This has one other advantage. Say you index the term "run" with
the above. If you don't do something like adding the $ to the
original, you can't distinguish between getting a hit on the
stem or not. That is, you can't distinguish between getting a hit
where the original word was "run" and one where the original
was "running". This may be important for "exact match".

Best
Erick

On 5/25/07, Enis Soztutar <enis.soz.nutch@gmail.com> wrote:
>
> Hi,
>
> In nutch we have a use case in which we need to store tokens with their
> original text plus their stemmed form plus their canonical form(through
> some asciifization). From my understanding of lucene, it makes sense to
> write a tokenstream which generates several tokens for each "word", but
> place all the tokens for the "word" at the same position with
> Token#setPositionIncrement(0).
> This way we could be able to search over this field using any
> form(stemmed, canonical, original) of the "word". Actually i have two
> questions here. First is that is there any way to avoid matching stemmed
> or canonical forms to a phrase query. Moreover it seems that adding
> multiple forms of the "word"s alters statistical calculations for
> scoring, especially for tf and idf, because the frequency of the root
> form of the word is incremented at each word with that root form. Is
> there any way that we could avoid it?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message