lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@safaribooksonline.com>
Subject Re: Rewrite for RegexpQuery
Date Mon, 11 Mar 2013 21:22:12 GMT
On 03/11/2013 01:22 PM, Michael McCandless wrote:
> On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober
> <schnober@ids-mannheim.de>  wrote:
>    
>> Am 11.03.2013 13:38, schrieb Michael McCandless:
>>      
>>> On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler<uwe@thetaphi.de>  wrote:
>>>
>>>        
>>>> Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this should
work (after rewrite your query is a BooleanQuery, which supports extractTerms()).
>>>>          
>>> ... as long as you don't exceed the max number of terms allowed by BQ
>>> (1024 by default, but you can raise it).
>>>        
>> True, I've noticed this meanwhile. Are there any recommendations for
>> this setting where the limit is as large as possible while staying
>> within a reasonable performance? Of course, this is highly subjective,
>> but what's the magnitude here? Will a limit of 1,024,000 typically
>> increase the query time by the factor 1,000 too?
>> Carsten
>>      
> I think 1024 may already be too high ;)
>
> But really it depends on your situation: test different limits and see.
>
> How much slower a larger query is depends on the specifics of the terms ...
>    
This doesn't really address the OP's question about selecting terms, but 
I thought it might be interesting...

We've taken some measurements of query performance scaling as you add 
terms, since we tend to generate large lists of query terms when 
restricting access to content by user entitlements.  I went back and 
read a theoretical result on the scaling here (sorry lost the link - I 
think it was in an early paper by Doug Cutting): it seems there is a log 
component and a linear component.  We saw mostly the linear behavior in 
our tests.  I think in practice, taking into consideration the amount of 
time dedicated to search vs other components of a complete system that 
1024 is a reasonable limit.  We've basically told our customers that if 
they want to entitle lists of > 1024 items, they should instead group 
them and sell the groups.  But of course there is flexibility to go to 
say 2K if we have to.  Anyway, just confirming the default seems 
sensible, but yes queries will slow down with more terms.

-Mike Sokolov

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message