lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Starts With x and Ends With x Queries
Date Sun, 06 Feb 2005 20:12:42 GMT

: book Managing Gigabytes, making "*string*" queries drastically more
: efficient for searching (though also impacting index size).  Take the
: term "cat".  It would be indexed with all rotated variations with an
: end of word marker added:
    ...
: The query for "*at*" would be preprocessed and rotated such that the
: wildcards are collapsed at the end to search for "at*" as a
: PrefixQuery.  A wildcard in the middle of a string like "c*t" would
: become a prefix query for "t$c*".

That's a pretty slick trick.

Considering how many Terms the index would wind up containing in order to
denormalize the data in that way, I wonder if it would be more practicle
to index each of the characters as a seperate term, with the word repeated
after the "end of word" character, making wildcard searches into "phase"
searches (after doing preprocessing and rotating as you described).

Ie, index "cat" as:   c a t $ c a t
  search for "*at*" as a phrase search for "a t"
  search for "*at"  as a phrase search for "a t $"
  search for "c*t"  as a phrase search for "t $ c"

...i'm fairly certain that would keep the index size much smaller (the
number of terms would be much smaller, while the average term frequence
wouldn't really increase), but i'm not sure if it would actaully be any
faster.  it depends on the algorithm/performace of PhraseQuery -- which is
something I haven't really looked into.  It could very well be
significantly slower.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message