lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets
Date Wed, 10 Nov 2010 19:22:14 GMT


Hoss Man commented on SOLR-2218:

The performance gets slower as the start increases because in order to give you rows N...M
sorted by score solr must collect the the top M documents (in sorted order) Lance's point
is that if you use "sort=_docid_+asc" this collection of top ranking documents in sorted order
doesn't have to happen.

If you have to use sorting, keep in mind that the decrease in performance as the "start" param
increases w/o bounds is primarily driven by the amount of documents that have to be collected/compared
on the sort field -- something thta wouldn't change if yo have a named cursor (you would just
be paying that cost up front instead of per request).

You should be able to get equivalent functionality by reducing the number of collected documents
-- instead of increasing the start param, add a filter on the sort field indicating that you
only want documents with a field value higher (or lower if using "desc" sort) then the last
document so far encountered.  (if you are sorting on score this becomes tricker, but should
be possible using the "frange" parser wit the "query" function)

> Performance of start= and rows= parameters are exponentially slow with large data sets
> --------------------------------------------------------------------------------------
>                 Key: SOLR-2218
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 1.4.1
>            Reporter: Bill Bell
> With large data sets, > 10M rows.
> Setting start=<large number> and rows=<large numbers> is slow, and gets slower
the farther you get from start=0 with a complex query. Random also makes this slower.
> Would like to somehow make this performance faster for looping through large data sets.
It would be nice if we could pass a pointer to the result set to loop, or support very large
> Something like:
> rows=1000
> start=0
> spointer=string_my_query_1
> Then within interval (like 5 mins) I can reference this loop:
> Something like:
> rows=1000
> start=1000
> spointer=string_my_query_1
> What do you think? Since the data is too great the cache is not helping.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message