Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 72685 invoked from network); 6 Feb 2005 14:00:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 6 Feb 2005 14:00:32 -0000 Received: (qmail 56124 invoked by uid 500); 6 Feb 2005 14:00:28 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 56058 invoked by uid 500); 6 Feb 2005 14:00:27 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 56045 invoked by uid 99); 6 Feb 2005 14:00:27 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from fork9.mail.Virginia.EDU (HELO fork9.mail.virginia.edu) (128.143.2.179) by apache.org (qpsmtpd/0.28) with ESMTP; Sun, 06 Feb 2005 06:00:27 -0800 Received: from localhost (localhost [127.0.0.1]) by fork9.mail.virginia.edu (Postfix) with ESMTP id 0931F1F52C8 for ; Sun, 6 Feb 2005 09:00:25 -0500 (EST) Received: from fork9.mail.virginia.edu ([127.0.0.1]) by localhost (fork9.mail.virginia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 24565-01 for ; Sun, 6 Feb 2005 09:00:24 -0500 (EST) Received: from [192.168.1.100] (va-chrvlle-cad1-bdgrp1-4b-b-169.chvlva.adelphia.net [68.169.41.169]) by fork9.mail.virginia.edu (Postfix) with ESMTP id 96CAA1F52B4 for ; Sun, 6 Feb 2005 09:00:24 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v619.2) In-Reply-To: References: <20050204165525.23042.qmail@web30201.mail.mud.yahoo.com> <03d601c50b1c$ae6ff5d0$7703d00a@hypermedia.com> <1081c493bd80a555f390191c641f9c45@ehatchersolutions.com> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <5425d345fc609349f5962d3e0e31b46d@ehatchersolutions.com> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: Starts With x and Ends With x Queries Date: Sun, 6 Feb 2005 09:00:26 -0500 To: "Lucene Users List" X-Mailer: Apple Mail (2.619.2) X-UVA-Virus-Scanned: by amavisd-new at fork9.mail.virginia.edu X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote: > If you want to start doing suffix queries (ie: all names ending with > "s", or all names ending with "Smith") one approach would be to use > WildcarQuery, which as Erik mentioned, will allow you to use a quey > Term > that starts with a "*". ie... > > Query q3 = new WildcardQuery(new Term("name","*s")); > Query q4 = new WildcardQuery(new Term("name","*Smith")); > > (NOTE: Erik says you can do this, but the docs for WildcardQuery say > you > can't I'll assume the docs are wrong and Erik is correct.) I assume you mean this comment on WildcardQuery's javadocs: "In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards * or ?." I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change "must" to "should". And yes, WildcardQuery itself supports a leading wildcard character exactly as you have shown. > Which leads me to my point: if you denormalize your data so that you > store > both the Term you want, and the *reverse* of the term you want, then a > Suffix query is just a Prefix query on a reversed field -- by > sacrificing > space, you can get all the speed efficiencies of a PrefixQuery when > doing > a SuffixQuery... > > D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ... > D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ... > D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ... > D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ... > > Query q1 = new PrefixQuery(new Term("name","J*")); > Query q2 = new PrefixQuery(new Term("name","Sue*")); > Query q3 = new PrefixQuery(new Term("rname","s*")); > Query q4 = new PrefixQuery(new Term("rname","htimS*")); > > > (If anyone sees a flaw in my theory, please chime in) This trick has been mentioned on this list before, and is a good one. I'll go one step further and mention another technique I found in the book Managing Gigabytes, making "*string*" queries drastically more efficient for searching (though also impacting index size). Take the term "cat". It would be indexed with all rotated variations with an end of word marker added: cat$ at$c t$ca $cat The query for "*at*" would be preprocessed and rotated such that the wildcards are collapsed at the end to search for "at*" as a PrefixQuery. A wildcard in the middle of a string like "c*t" would become a prefix query for "t$c*". Has anyone tried this technique with Lucene? Erik --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org