Return-Path: Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 35533 invoked from network); 14 Oct 2001 22:52:05 -0000 Received: from relay3.uswest.net (HELO relay1.uswest.net) (63.226.138.11) by daedalus.apache.org with SMTP; 14 Oct 2001 22:52:05 -0000 Received: (qmail 17931 invoked by uid 0); 14 Oct 2001 22:52:12 -0000 Received: from unknown (HELO earthlink.net) (65.100.117.194) by relay3.uswest.net with SMTP; 14 Oct 2001 22:52:12 -0000 Message-ID: <3BCA1715.3050401@earthlink.net> Date: Sun, 14 Oct 2001 16:52:05 -0600 From: Dmitry Serebrennikov User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.2) Gecko/20010726 Netscape6/6.1 X-Accept-Language: en-us MIME-Version: 1.0 To: lucene-dev@jakarta.apache.org Subject: Re: How to search/index in a generic way? - Urgent! References: <1003095622.71288.ezmlm@jakarta.apache.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Here's a short answer: - Yes, analyzer used during indexing must be the same as the one used during query preparation (at least they must be compatible so that the query contains terms that exactly equal to those that were eventually stored in the index for your documents). - There are a couple of queries (PrefixQuery, WildcardQuery) that can search for terms partially matching those specified in the query. You could search for all documents containing terms that begin with specified string, for example. - There is a "FuzzyQuery" which matches based on the "edit distance". Edit distance between two strings is defined as the number of primitive edit operations that must be performed on one string to make it into the other (replace char, add char, delete char). This may be useful for what you are trying to do. - You could, perhaps, split all of the words in the source text (file names in your case) into a stream of individual letters and index each letter. Then you could split your query the same way and use a PhraseQuery to search for sequences of letters in the source text that match the sequence of letters in your query. You can also control the tollerance of the match by changing the "slop" on the PhraseQuery. If you go this route, which I had never tried before, you might find that index sizes get quite large and searches take longer, but maybe this will be just what you need and it may be fast enough anyway. If you had not, check the FAQ for more information (linked from http://www.lucene.com). Besides the FAQ, the next best resource is the archive of this and lucene-user lists. (Most of the lists' history is still on sourceforge I believe). Good luck! Dmitry.