lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <...@digimem.net>
Subject RE: contains
Date Tue, 16 Jul 2002 20:43:06 GMT
I also think it might work.  Just for fun I tried a variation of it on
a copy of the unix file /usr/dict/words.

Indexing program reads each word and splits it up into substrings and
stores the original and the substrings e.g. "beautiful" stored as
Field.UnIndexed, substrings "beautiful eautiful autiful utiful tiful
iful ful ul l" stored as Field.UnStored. One Document per word.

Search program uses StandardAnalyzer and QueryParser.  Pass it "ful*"
and it returns 259 hits, including beautiful and beautifully.  Pass it
"eaut" and get 13 hits including beauteous and beauty.  ifu* gives 20
hits including beautiful, bifurcate and centrifuge.

There are 45,407 words in the file and the index, unoptimized, takes
up 3.6Mb of disk space.  Indexing the 45,407 words by themselves, one
Document per word with the word as Field.Text takes up 1.7Mb disk
space, unoptimized.

Didn't add inverse substrings since don't see why they are needed.
My sample program seems to work without them.  I expect I've missed
something or perhaps it works because the sample data is so simple.



--
Ian.
ian@digimem.net

> lothar.simon@eidon.de (Lothar Simon) wrote 
>
> Just to correct a few points:
> - The factor would be 2 * (average no of chars per word)/2 = (average no of
> chars per word).
> - One would probably create a set of 2 * (maximum number of chars per word)
> as Fields for a document. If this could work was actually my question...
> - Most important: my proposal is exactly (and almost only) designed to solve
> the substring ("*uti*") problem !!! One field in the first group of fields
> in my example contains "utiful" and would be found by "uti*", a field in the
> other group of fields contains "itueb" and would be found by "itu*". Voila!
> 
> I still think my idea would work (given you spend the space for the index).
> 
> Lothar
> 
> 
> -----Original Message-----
> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
> Sent: Friday, July 12, 2002 6:45 PM
> To: Lucene Users List
> Subject: RE: contains
> 
> 
> On Fri, 12 Jul 2002, Lothar Simon wrote:
> 
> [in response to Peter Carlson pointing out that searching for *xyz* is a
> difficult problem]
> > Of course you are right. And I am surely more the last then the first
> > one to try to come up with THE solution for this. But still... Could
> > the following work?
> >
> > If space (ok, a lot) is available you could store "beutiful",
> > "eutiful", "utiful", "tiful", "iful", "ful", "ul", "l" PLUS its
> > inversions ("lufitueb", "ufitueb", "fitueb", "itueb", "tueb", "ueb",
> > "eb", "b") in the index. Space needed would be something like (average
> > no of chars per word) as much as in a "normal" index.
> 
> Actually it would be twice that, because you're storing backward and
> forward versions.  I'd hazard a guess that this factor alone would mean
> something like a 10- or 12-fold increase in index size (the average length
> of a word is less than 5 or 6 letters, but by throwing out stop words you
> throw out a lot of the words that drag the average down).
> 
> Another problem with this is that in order to be able to get from "ful" to
> "beautiful", you have to store, in the index entry for "ful", (pointers
> to) every single complete word in your document set that contains "ful" as
> a substring.  Just _creating_ such an index would be extremely
> time-consuming even with clever data structures, and consider how much
> extra storage for pointers would be necessary for entries like "e" or "n".
> 
> Finally, you're not including all substrings: your scheme doesn't allow me
> to search for "*uti*" and find "beautiful".  If you did, the number of
> entries would then be multiplied by a factor of the _square_ of the
> average number of characters per word.  (You might be able to avoid this
> by doing prefix and suffix searches--which are difficult but less so--on
> the strings you specify, though.)
> 
> There might be some clever way to get around these problems, but I suspect
> that developing one would be a dissertation topic.  :)
> 
> Regards,
> 
> Joshua O'Madadhain
> 
>  jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
>   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any organization.

----------------------------------------------------------------------
Searchable personal storage and archiving from http://www.digimem.net/


Mime
View raw message