Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: domain of greenlion@gmail.com
 designates 64.233.170.198 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding;
        b=bRdM/MZ5HIwd+Jx6cYXo4gbI6/9qYjBRxK3fS6qNnhpfLVoha4nFuVMxwkbkyYTUhTtNiLLOYU0/jH/13Uh6CJaNtTZPxDtxKU+HBIKZC2/Wt2/gPz4bVwqZpoLHUR+D/dgmGtvJs83f0fCrN4kB2hPIctTgW9as8jOSkYvAGUc=
Message-ID: <dd2db8d0041104095239a16a9e@mail.gmail.com>
Date: Thu, 4 Nov 2004 10:52:21 -0700
From: Justin Swanhart <greenlion@gmail.com>
Reply-To: Justin Swanhart <greenlion@gmail.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Subject: prefix wildcard matching options (*blah)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

I'm thinking about making a seperate field in my index for prefix
wildcard searches.
I would chop off x characters from the front to create "subtokens" for
the prefix matches.

For the term: republican
terms created: republican epublican publican ublican blican

My query parser would then intelligently decide if their is a term
that has a wildcard as the first character of the term.  Instead of
searching the normal field, it would then remove the wildcard from the
start of the term and search on the prefix field instead.

A search for "*pub*" would be converted to "pub*" in the prefix field.  
A search for "*blican" would be converted to "blican"

Does this sound like an intelligent way to create fast prefix querying ability?

Can I index the prefix field with a seperate analyzer that makes the
prefix tokens, or should I just do the index-time expansion manually? 
I wouldn't need to search with this analyzer, just index with it,
because the searching doesn't have to expand all those terms.

If using a seperate analyzer for the prefix field makes more sense how
do I make a tokenizer that returns multiple tokens for one word?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org