lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: How to search/index in a generic way? - Urgent!
Date Sun, 14 Oct 2001 22:52:05 GMT
Here's a short answer:

- Yes, analyzer used during indexing must be the same as the one used 
during query preparation (at least they must be compatible so that the 
query contains terms that exactly equal to those that were eventually 
stored in the index for your documents).

- There are a couple of queries (PrefixQuery, WildcardQuery) that can 
search for terms partially matching those specified in the query. You 
could search for all documents containing terms that begin with 
specified string, for example.

- There is a "FuzzyQuery" which matches based on the "edit distance". 
Edit distance between two strings is defined as the number of primitive 
edit operations that must be performed on one string to make it into the 
other (replace char, add char, delete char). This may be useful for what 
you are trying to do.

- You could, perhaps, split all of the words in the source text (file 
names in your case) into a stream of individual letters and index each 
letter. Then you could split your query the same way and use a 
PhraseQuery to search for sequences of letters in the source text that 
match the sequence of letters in your query. You can also control the 
tollerance of the match by changing the "slop" on the PhraseQuery. If 
you go this route, which I had never tried before, you might find that 
index sizes get quite large and searches take longer, but maybe this 
will be just what you need and it may be fast enough anyway.

If you had not, check the FAQ for more information (linked from 
http://www.lucene.com). Besides the FAQ, the next best resource is the 
archive of this and lucene-user lists. (Most of the lists' history is 
still on sourceforge I believe).

Good luck!
Dmitry.



Mime
View raw message