Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 53775 invoked from network); 12 Oct 2005 23:45:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 12 Oct 2005 23:45:51 -0000 Received: (qmail 21646 invoked by uid 500); 12 Oct 2005 23:45:48 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 21616 invoked by uid 500); 12 Oct 2005 23:45:48 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 21605 invoked by uid 99); 12 Oct 2005 23:45:47 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2005 16:45:47 -0700 Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.55.225.129] (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2005 16:45:50 -0700 Received: by ehatchersolutions.com (Postfix, from userid 504) id 2C1F613E2007; Wed, 12 Oct 2005 19:45:23 -0400 (EDT) Received: from [172.16.1.101] (va-71-48-138-146.dhcp.sprint-hsd.net [71.48.138.146]) by ehatchersolutions.com (Postfix) with ESMTP id E4BCF13E2005 for ; Wed, 12 Oct 2005 19:44:46 -0400 (EDT) Mime-Version: 1.0 (Apple Message framework v734) Content-Transfer-Encoding: 7bit Message-Id: Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed To: java-dev@lucene.apache.org From: Erik Hatcher Subject: regex-based query contribution Date: Wed, 12 Oct 2005 19:44:42 -0400 X-Mailer: Apple Mail (2.734) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Status: No, score=-5.9 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Level: X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I've developed normal and span-based Query implementations that use regex to match index terms rather than the simplified WildcardQuery. This allows for queries like "abc[0-9]xyz" that would match abc1xyz, but not abc12xyz for example. I've seen a lot of interest lately in being able to do a phrase query with a nested wildcard term inside, such as "the q.*k brown f.x". I turn a query like that into a SpanNearQuery of SpanTermQuery("the"), SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery ("f.x") with a slop of 0. The code is fairly minimal thanks to the wonderful infrastructure already provided. I'm ready to contribute it to Lucene. The question is, where? Should this be part of the core? Or should it reside in a contrib area? If in contrib, shall it be a new area called "regex" perhaps, or "regex-query"? I'm inclined to put it in the core, so if I don't hear otherwise I'll start with it there. The main negative to this query, just like with WildcardQuery and FuzzyQuery, is the possible performance issue. However, just like WildcardQuery, this really depends on how clever the indexing side of things is and matching that cleverness with an appropriate regex. In my actual use of these queries involves doing overlapped rotated term indexing and also rotating the query term to have the best possible prefix for term enumeration. Naive use of this query using ".*foo" of course will have the same impact as WildcardQuery using *foo - and perhaps slightly slower with regex matching involved. Overall, I think it is a good addition and will allow users to be more expressive than the lower-level MultiPhraseQuery (aka PhrasePrefixQuery). Thoughts? Erik --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org