Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 10181 invoked from network); 6 Feb 2005 20:12:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 6 Feb 2005 20:12:58 -0000 Received: (qmail 88365 invoked by uid 500); 6 Feb 2005 20:12:45 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 88335 invoked by uid 500); 6 Feb 2005 20:12:45 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 88321 invoked by uid 99); 6 Feb 2005 20:12:45 -0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=SPF_HELO_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from keyserver.Rescomp.Berkeley.EDU (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.28) with ESMTP; Sun, 06 Feb 2005 12:12:44 -0800 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id EA9B25B7B8; Sun, 6 Feb 2005 12:12:42 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id E97507F45C for ; Sun, 6 Feb 2005 12:12:42 -0800 (PST) Date: Sun, 6 Feb 2005 12:12:42 -0800 (PST) From: Chris Hostetter Sender: hossman@hal.rescomp.berkeley.edu To: Lucene Users List Subject: Re: Starts With x and Ends With x Queries In-Reply-To: <5425d345fc609349f5962d3e0e31b46d@ehatchersolutions.com> Message-ID: References: <20050204165525.23042.qmail@web30201.mail.mud.yahoo.com> <03d601c50b1c$ae6ff5d0$7703d00a@hypermedia.com> <1081c493bd80a555f390191c641f9c45@ehatchersolutions.com> <5425d345fc609349f5962d3e0e31b46d@ehatchersolutions.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N : book Managing Gigabytes, making "*string*" queries drastically more : efficient for searching (though also impacting index size). Take the : term "cat". It would be indexed with all rotated variations with an : end of word marker added: ... : The query for "*at*" would be preprocessed and rotated such that the : wildcards are collapsed at the end to search for "at*" as a : PrefixQuery. A wildcard in the middle of a string like "c*t" would : become a prefix query for "t$c*". That's a pretty slick trick. Considering how many Terms the index would wind up containing in order to denormalize the data in that way, I wonder if it would be more practicle to index each of the characters as a seperate term, with the word repeated after the "end of word" character, making wildcard searches into "phase" searches (after doing preprocessing and rotating as you described). Ie, index "cat" as: c a t $ c a t search for "*at*" as a phrase search for "a t" search for "*at" as a phrase search for "a t $" search for "c*t" as a phrase search for "t $ c" ...i'm fairly certain that would keep the index size much smaller (the number of terms would be much smaller, while the average term frequence wouldn't really increase), but i'm not sure if it would actaully be any faster. it depends on the algorithm/performace of PhraseQuery -- which is something I haven't really looked into. It could very well be significantly slower. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org