lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: regex-based query contribution
Date Thu, 13 Oct 2005 12:36:03 GMT
On Oct 13, 2005, at 7:36 AM, Mikko Noromaa wrote:
> Hi,
>> It would be possible to do a PatternQuery("*") that would
>> enumerate every term.
> Does this work differently than the current logic where wildcard  
> queries are
> constructed as BooleanQueries with many terms OR'ed together? I  
> think this
> would be a good change.

No - it works identically to WildcardQuery, with the only difference  
being how it matches.  The added bonus though is that there is a  
SpanPatternQuery to go along with this, allowing for "foo bar*"  
phrase queries.

> I have always thought that it is quite cumbersome to expand  
> wildcards to
> many boolean clauses. I think that keeping the wildcard (or regex  
> in this
> case) in the query object would be much better. On the other hand,  
> it might
> not make any difference in performance, since Lucene would still  
> have to go
> through all the terms. But at least it would avoid the
> BooleanQuery$TooManyClauses exception even with thousands of different
> terms. Right?

At this point, the possibility of that exception still exists so  
increasing the maximum number of clauses is necessary to avoid it.

> I know I can increase the limit of the boolean queries, but there  
> is still a
> limit. In my application, I index Finnish text which has lots of  
> different
> suffixes for the same word. With compound words included, I could  
> easily
> imagine that the same base word may have hundreds or thousands of  
> terms in
> the index.

Hundreds is still under the 1024 built-in restriction for  
BooleanQuery.  Thousands is do-able by increasing the limit and  
having sufficient RAM.

For suffix-wildcards, there really is no difference between my  
PatternQuery and WildcardQuery - WildcardQuery may even be faster if  
it's matching is quicker than regex (though tests would need to be  
performed to confirm, I'd guess that the performance difference isn't  
all that much).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message