From Paul Elschot <>
Subject Re: regex-based query contribution
Date Thu, 13 Oct 2005 07:15:38 GMT
On Thursday 13 October 2005 01:44, Erik Hatcher wrote:
> I've developed normal and span-based Query implementations that use  
> regex to match index terms rather than the simplified WildcardQuery.   
> This allows for queries like "abc[0-9]xyz" that would match abc1xyz,  
> but not abc12xyz for example.
> I've seen a lot of interest lately in being able to do a phrase query  
> with a nested wildcard term inside, such as "the q.*k brown f.x".  I  
> turn a query like that into a SpanNearQuery of SpanTermQuery("the"),  
> SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery 
> ("f.x") with a slop of 0.
> The code is fairly minimal thanks to the wonderful infrastructure  
> already provided.  I'm ready to contribute it to Lucene.  The  
> question is, where?  Should this be part of the core?  Or should it  
> reside in a contrib area?  If in contrib, shall it be a new area  
> called "regex" perhaps, or "regex-query"?
> I'm inclined to put it in the core, so if I don't hear otherwise I'll  
> start with it there.
> The main negative to this query, just like with WildcardQuery and  
> FuzzyQuery, is the possible performance issue.  However, just like  
> WildcardQuery, this really depends on how clever the indexing side of  
> things is and matching that cleverness with an appropriate regex.  In  
> my actual use of these queries involves doing overlapped rotated term  
> indexing and also rotating the query term to have the best possible  
> prefix for term enumeration.  Naive use of this query using ".*foo"  
> of course will have the same impact as WildcardQuery using *foo - and  
> perhaps slightly slower with regex matching involved.
> Overall, I think it is a good addition and will allow users to be  
> more expressive than the lower-level MultiPhraseQuery (aka  
> PhrasePrefixQuery).
> Thoughts?

In the surround language, this was done by splitting the query term
in a fixed prefix and a remainder starting with a truncation character.
For this remainder a regular expression is built and used.
The prefix is used to limit the number of terms fed to the regular expression
matcher. The code is in here:

So, with an addition to the javadocs that the length of the prefix is
important for performance, I think a regular expression based query term
would be very useful, especially when combined an analyzer that does
appropriate term rotation.

Paul Elschot

