Here's a short answer:
- Yes, analyzer used during indexing must be the same as the one used
during query preparation (at least they must be compatible so that the
query contains terms that exactly equal to those that were eventually
stored in the index for your documents).
- There are a couple of queries (PrefixQuery, WildcardQuery) that can
search for terms partially matching those specified in the query. You
could search for all documents containing terms that begin with
specified string, for example.
- There is a "FuzzyQuery" which matches based on the "edit distance".
Edit distance between two strings is defined as the number of primitive
edit operations that must be performed on one string to make it into the
other (replace char, add char, delete char). This may be useful for what
you are trying to do.
- You could, perhaps, split all of the words in the source text (file
names in your case) into a stream of individual letters and index each
letter. Then you could split your query the same way and use a
PhraseQuery to search for sequences of letters in the source text that
match the sequence of letters in your query. You can also control the
tollerance of the match by changing the "slop" on the PhraseQuery. If
you go this route, which I had never tried before, you might find that
index sizes get quite large and searches take longer, but maybe this
will be just what you need and it may be fast enough anyway.
If you had not, check the FAQ for more information (linked from
http://www.lucene.com). Besides the FAQ, the next best resource is the
archive of this and lucene-user lists. (Most of the lists' history is
still on sourceforge I believe).
Good luck!
Dmitry.
|