lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua O'Madadhain <jmad...@ics.uci.edu>
Subject RE: contains
Date Fri, 12 Jul 2002 16:45:29 GMT
On Fri, 12 Jul 2002, Lothar Simon wrote:

[in response to Peter Carlson pointing out that searching for *xyz* is a
difficult problem]
> Of course you are right. And I am surely more the last then the first
> one to try to come up with THE solution for this. But still... Could
> the following work?
> 
> If space (ok, a lot) is available you could store "beutiful",
> "eutiful", "utiful", "tiful", "iful", "ful", "ul", "l" PLUS its
> inversions ("lufitueb", "ufitueb", "fitueb", "itueb", "tueb", "ueb",
> "eb", "b") in the index. Space needed would be something like (average
> no of chars per word) as much as in a "normal" index.

Actually it would be twice that, because you're storing backward and
forward versions.  I'd hazard a guess that this factor alone would mean
something like a 10- or 12-fold increase in index size (the average length
of a word is less than 5 or 6 letters, but by throwing out stop words you
throw out a lot of the words that drag the average down).

Another problem with this is that in order to be able to get from "ful" to
"beautiful", you have to store, in the index entry for "ful", (pointers
to) every single complete word in your document set that contains "ful" as
a substring.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like "e" or "n".

Finally, you're not including all substrings: your scheme doesn't allow me
to search for "*uti*" and find "beautiful".  If you did, the number of
entries would then be multiplied by a factor of the _square_ of the
average number of characters per word.  (You might be able to avoid this
by doing prefix and suffix searches--which are difficult but less so--on
the strings you specify, though.)

There might be some clever way to get around these problems, but I suspect
that developing one would be a dissertation topic.  :)

Regards,

Joshua O'Madadhain
 
 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message