lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl / Cominvent <jan....@cominvent.com>
Subject Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets
Date Thu, 11 Nov 2010 20:22:37 GMT
The problem with large "start" is probably worse when sharding is involved. Anyone know how
the shard component goes about fetching start=1000000&rows=10 from say 10 shards? Does
it have to merge sorted lists of 1mill+10 docsids from each shard which is the worst case?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. nov. 2010, at 20.22, Hoss Man (JIRA) wrote:

> 
>    [ https://issues.apache.org/jira/browse/SOLR-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930723#action_12930723
] 
> 
> Hoss Man commented on SOLR-2218:
> --------------------------------
> 
> The performance gets slower as the start increases because in order to give you rows
N...M sorted by score solr must collect the the top M documents (in sorted order) Lance's
point is that if you use "sort=_docid_+asc" this collection of top ranking documents in sorted
order doesn't have to happen.
> 
> If you have to use sorting, keep in mind that the decrease in performance as the "start"
param increases w/o bounds is primarily driven by the amount of documents that have to be
collected/compared on the sort field -- something thta wouldn't change if yo have a named
cursor (you would just be paying that cost up front instead of per request).
> 
> You should be able to get equivalent functionality by reducing the number of collected
documents -- instead of increasing the start param, add a filter on the sort field indicating
that you only want documents with a field value higher (or lower if using "desc" sort) then
the last document so far encountered.  (if you are sorting on score this becomes tricker,
but should be possible using the "frange" parser wit the "query" function)
> 
>> Performance of start= and rows= parameters are exponentially slow with large data
sets
>> --------------------------------------------------------------------------------------
>> 
>>                Key: SOLR-2218
>>                URL: https://issues.apache.org/jira/browse/SOLR-2218
>>            Project: Solr
>>         Issue Type: Improvement
>>         Components: Build
>>   Affects Versions: 1.4.1
>>           Reporter: Bill Bell
>> 
>> With large data sets, > 10M rows.
>> Setting start=<large number> and rows=<large numbers> is slow, and gets
slower the farther you get from start=0 with a complex query. Random also makes this slower.
>> Would like to somehow make this performance faster for looping through large data
sets. It would be nice if we could pass a pointer to the result set to loop, or support very
large rows=<number>.
>> Something like:
>> rows=1000
>> start=0
>> spointer=string_my_query_1
>> Then within interval (like 5 mins) I can reference this loop:
>> Something like:
>> rows=1000
>> start=1000
>> spointer=string_my_query_1
>> What do you think? Since the data is too great the cache is not helping.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message