lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar" <>
Subject Re: multiple tokens at the same position
Date Fri, 25 May 2007 19:13:38 GMT
On 5/25/07, Chris Hostetter <> wrote:
> : Yes, indeed we could but it brings other problems, for example
> increasing
> : the index size, and extending the query to search for multiple fields,
> etc.
> 1) if you index both teh raw and stemmed forms your index is going to grow
> to roughly the same size regardless of wether the stem and the arw are in
> teh same field or differnet fields.

Well using a single field i think we do not need to store stemmed form of
the word if it is indeed the same with the original, so that the index will
not grow two times the size. However then one can say that using more than
one field, it is also possible to not index the root forms twice (but i am
not sure how to query them).

2) For a particular user action, if you only want to search for one form
> (either raw or stemmed) then your query doesn't have to search over
> multiple fields -- it just has to search on which ever field you care
> about for the particular user action.  if you want to search for *both*
> the stemmed and raw forms n a single query, then the compelxity of hte
> query is the same regardless of wether the two clauses are on the same
> field or differnet fields.

If we also stem the query from the user, then infact we need to only search
for the stemmed form. But it is then necessary to determine the language of
the query(if possible) in order to stem it.

: > > or canonical forms to a phrase query. Moreover it seems that adding
> : > > multiple forms of the "word"s alters statistical calculations for
> : > > scoring, especially for tf and idf, because the frequency of the
> root
> : > > form of the word is incremented at each word with that root form. Is
> it's a matter of opinion wether this is "right" or not ... if you are
> storing both the raw and stemmed forms then in theory your tf/idf numbers
> now both represent "twice" what they normally would and it balances out..
> "dog" is not only a raw word, but also it's own stem, so a tf(docA,dog)=2
> for one real instance of "dog" is just as correct as a doc that contains
> "dogs" and gets tf(docB,dog)=1 and tf(docB,dogs)=1.

You are right about in that calculation, but i was wondering about the tf
and idf in cases where we do not store the root forms twice. In that case i
think tf of the root forms increases, however the derived forms
decreases(since total number of terms nearly doubles).

If you disagree with this line of thinking, asimple way to fix the problem
> is to use a TokenFilter that removes any tokens at the same position which
> contain the same text...

thanks for the pointer

> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
I think Erick's way of indexing solves both problems 1 and 2. However it is
effectively not different than storing the forms in seperate fields. How
would the query performance differ the case that we index all the forms in
one field(w/o storing root forms twice), and query only this field, from the
case which we index the forms in seperate fields and run the query in all of

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message