From Erik Hatcher <>
Subject regex-based query contribution
Date Wed, 12 Oct 2005 23:44:42 GMT
I've developed normal and span-based Query implementations that use  
regex to match index terms rather than the simplified WildcardQuery.   
This allows for queries like "abc[0-9]xyz" that would match abc1xyz,  
but not abc12xyz for example.

I've seen a lot of interest lately in being able to do a phrase query  
with a nested wildcard term inside, such as "the q.*k brown f.x".  I  
turn a query like that into a SpanNearQuery of SpanTermQuery("the"),  
SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery 
("f.x") with a slop of 0.

The code is fairly minimal thanks to the wonderful infrastructure  
already provided.  I'm ready to contribute it to Lucene.  The  
question is, where?  Should this be part of the core?  Or should it  
reside in a contrib area?  If in contrib, shall it be a new area  
called "regex" perhaps, or "regex-query"?

I'm inclined to put it in the core, so if I don't hear otherwise I'll  
start with it there.

The main negative to this query, just like with WildcardQuery and  
FuzzyQuery, is the possible performance issue.  However, just like  
WildcardQuery, this really depends on how clever the indexing side of  
things is and matching that cleverness with an appropriate regex.  In  
my actual use of these queries involves doing overlapped rotated term  
indexing and also rotating the query term to have the best possible  
prefix for term enumeration.  Naive use of this query using ".*foo"  
of course will have the same impact as WildcardQuery using *foo - and  
perhaps slightly slower with regex matching involved.

Overall, I think it is a good addition and will allow users to be  
more expressive than the lower-level MultiPhraseQuery (aka  



