lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lothar Simon" <lothar.si...@eidon.de>
Subject RE: contains
Date Tue, 16 Jul 2002 16:27:50 GMT
Just to correct a few points:
- The factor would be 2 * (average no of chars per word)/2 = (average no of
chars per word).
- One would probably create a set of 2 * (maximum number of chars per word)
as Fields for a document. If this could work was actually my question...
- Most important: my proposal is exactly (and almost only) designed to solve
the substring ("*uti*") problem !!! One field in the first group of fields
in my example contains "utiful" and would be found by "uti*", a field in the
other group of fields contains "itueb" and would be found by "itu*". Voila!

I still think my idea would work (given you spend the space for the index).

Lothar


-----Original Message-----
From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
Sent: Friday, July 12, 2002 6:45 PM
To: Lucene Users List
Subject: RE: contains


On Fri, 12 Jul 2002, Lothar Simon wrote:

[in response to Peter Carlson pointing out that searching for *xyz* is a
difficult problem]
> Of course you are right. And I am surely more the last then the first
> one to try to come up with THE solution for this. But still... Could
> the following work?
>
> If space (ok, a lot) is available you could store "beutiful",
> "eutiful", "utiful", "tiful", "iful", "ful", "ul", "l" PLUS its
> inversions ("lufitueb", "ufitueb", "fitueb", "itueb", "tueb", "ueb",
> "eb", "b") in the index. Space needed would be something like (average
> no of chars per word) as much as in a "normal" index.

Actually it would be twice that, because you're storing backward and
forward versions.  I'd hazard a guess that this factor alone would mean
something like a 10- or 12-fold increase in index size (the average length
of a word is less than 5 or 6 letters, but by throwing out stop words you
throw out a lot of the words that drag the average down).

Another problem with this is that in order to be able to get from "ful" to
"beautiful", you have to store, in the index entry for "ful", (pointers
to) every single complete word in your document set that contains "ful" as
a substring.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like "e" or "n".

Finally, you're not including all substrings: your scheme doesn't allow me
to search for "*uti*" and find "beautiful".  If you did, the number of
entries would then be multiplied by a factor of the _square_ of the
average number of characters per word.  (You might be able to avoid this
by doing prefix and suffix searches--which are difficult but less so--on
the strings you specify, though.)

There might be some clever way to get around these problems, but I suspect
that developing one would be a dissertation topic.  :)

Regards,

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message