lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Why are cursor mark queries recommended over regular start, rows combination?
Date Tue, 13 Mar 2018 01:59:41 GMT
On 3/12/2018 6:18 PM, S G wrote:
> We have use-cases where some queries will return about 100k to 500k records.
> As per https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html,
> it seems that using start=x, rows=y is a bad combination performance wise.
>
> 1) However, it is not clear to me why the alternative: "cursor-query" is
> cheaper or recommended. It would have to run the same kind of workload as
> the normal start=x, rows=y combination, no?

No.  Through the use of cleverly designed filters, cursorMark is able to
dramatically reduce the amount of information that Solr has to sift
through when paging deeply into results.  Because of the way it works,
cursorMark does not offer any way jump directly to page 25000 -- you
have to get the previous 24999 pages first.  But the retrieval time of
every one of those pages is going to be about the same as page 1.

If you use start/rows, the retrieval time of every subsequent page is
going to increase, and by the time the page numbers start getting big,
the response time for every page is going to be VERY large.

Hoss, who created cursorMark, explains it all pretty well in this article:

https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

> 2) Also, it is not clear if the cursory-query runs on a single shard or
> uses the same scatter gather as regular queries to read from all the shards?

The cursorMark feature works on sharded indexes.  In fact, that's where
it offers the best performance improvement over start/rows.

> 3) Lastly, it is not clear the role of export handler. It seems that the
> export handler would also have to do exactly the same kind of thing as
> start=0 and rows=1000,000. And that again means bad performance.

The standard search handlers must gather all of the information
(documents, etc) in the response into memory all at once, then send that
information to the entity that made the request.  This is why the rows
parameter defaults to 10.  By limiting the amount of information in a
response, that response is sent faster and consumes less memory.

The export handler works differently.  I haven't researched this, but I
*THINK* what it does is gathers documents matching the query and sort
parameters a little bit at a time, writes that response information out
to the HTTP/TCP socket, and then throws the source data away.  By
repeating this cycle many times, it can send millions of results without
consuming huge amounts of memory.  The HTTP standard supports this kind
of open-ended transfer of data.

Thanks,
Shawn


Mime
View raw message