lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: multiple tokens at the same position
Date Fri, 25 May 2007 18:25:51 GMT

: Yes, indeed we could but it brings other problems, for example increasing
: the index size, and extending the query to search for multiple fields, etc.

1) if you index both teh raw and stemmed forms your index is going to grow
to roughly the same size regardless of wether the stem and the arw are in
teh same field or differnet fields.

2) For a particular user action, if you only want to search for one form
(either raw or stemmed) then your query doesn't have to search over
multiple fields -- it just has to search on which ever field you care
about for the particular user action.  if you want to search for *both*
the stemmed and raw forms n a single query, then the compelxity of hte
query is the same regardless of wether the two clauses are on the same
field or differnet fields.

: > > or canonical forms to a phrase query. Moreover it seems that adding
: > > multiple forms of the "word"s alters statistical calculations for
: > > scoring, especially for tf and idf, because the frequency of the root
: > > form of the word is incremented at each word with that root form. Is

it's a matter of opinion wether this is "right" or not ... if you are
storing both the raw and stemmed forms then in theory your tf/idf numbers
now both represent "twice" what they normally would and it balances out..
"dog" is not only a raw word, but also it's own stem, so a tf(docA,dog)=2
for one real instance of "dog" is just as correct as a doc that contains
"dogs" and gets tf(docB,dog)=1 and tf(docB,dogs)=1.

If you disagree with this line of thinking, asimple way to fix the problem
is to use a TokenFilter that removes any tokens at the same position which
contain the same text...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message