lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikko Noromaa" <>
Subject RE: regex-based query contribution
Date Thu, 13 Oct 2005 11:36:31 GMT

> It would be possible to do a PatternQuery("*") that would
> enumerate every term.

Does this work differently than the current logic where wildcard queries are
constructed as BooleanQueries with many terms OR'ed together? I think this
would be a good change.

I have always thought that it is quite cumbersome to expand wildcards to
many boolean clauses. I think that keeping the wildcard (or regex in this
case) in the query object would be much better. On the other hand, it might
not make any difference in performance, since Lucene would still have to go
through all the terms. But at least it would avoid the
BooleanQuery$TooManyClauses exception even with thousands of different
terms. Right?

I know I can increase the limit of the boolean queries, but there is still a
limit. In my application, I index Finnish text which has lots of different
suffixes for the same word. With compound words included, I could easily
imagine that the same base word may have hundreds or thousands of terms in
the index.


Mikko Noromaa ( - tel. +358 40 7348034
Noromaa Solutions - see

> -----Original Message-----
> From: Erik Hatcher [] 
> Sent: Thursday, October 13, 2005 1:54 PM
> To:
> Subject: Re: regex-based query contribution
> On Oct 13, 2005, at 3:15 AM, Paul Elschot wrote:
> >> The main negative to this query, just like with WildcardQuery and
> >> FuzzyQuery, is the possible performance issue.  However, just like
> >> WildcardQuery, this really depends on how clever the 
> indexing side of
> >> things is and matching that cleverness with an appropriate 
> regex.  In
> >> my actual use of these queries involves doing overlapped 
> rotated term
> >> indexing and also rotating the query term to have the best possible
> >> prefix for term enumeration.  Naive use of this query using ".*foo"
> >> of course will have the same impact as WildcardQuery using 
> *foo - and
> >> perhaps slightly slower with regex matching involved.
> >>
> >> Overall, I think it is a good addition and will allow users to be
> >> more expressive than the lower-level MultiPhraseQuery (aka
> >> PhrasePrefixQuery).
> >>
> >> Thoughts?
> >>
> >
> > In the surround language, this was done by splitting the query term
> > in a fixed prefix and a remainder starting with a truncation  
> > character.
> > For this remainder a regular expression is built and used.
> > The prefix is used to limit the number of terms fed to the regular  
> > expression
> > matcher. The code is in here:
> > 
> > surround/src/java/org/apache/lucene/queryParser/surround/query/
> Likewise with my PatternQuery - it limits the term enumeration just  
> as WildcardQuery does, to the fixed prefix.
> > So, with an addition to the javadocs that the length of the 
> prefix is
> > important for performance, I think a regular expression 
> based query  
> > term
> > would be very useful, especially when combined an analyzer that does
> > appropriate term rotation.
> Right - I just mentioned the caveat to have the bases covered.  It  
> would be possible to do a PatternQuery("*") that would enumerate  
> every term.  At this point - anyone using such a query would have to  
> do it by the API, just as they would the SpanQuery family - so it  
> would be for power users that hopefully would understand how these  
> queries work.
> And with term rotation, as  you say, things get much much better!
>      Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message