jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: dealing with large result sets
Date Tue, 10 Apr 2012 09:32:15 GMT
On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <mls@pooteeweet.org> wrote:
> Hi,
> Currently I see some big issues with queries that return large result sets. A lot of
work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside
2.x). However I think its important to do some intermediate improvements.
> Here are some suggestions I have. I hope we can brainstorm together on some ideas that
are feasible to get implemented in a shorter time period than waiting for Oak:
> 1) there should be a way to get a count
> This way if I need to do a query that needs to be ordered, I can first check if the count
is too high to determine if I should even bother running the search. Aka in most cases a search
leading to 100+ results means that who ever did the search needs to further narrow it down.

The cpu is not spend in ordering the results: That is done quite fast
in Lucene, unless you have millions of hits

The problem with getting a correct count is authorization : This total
search index count should is fast (if you try to avoid some known slow
searches). However, authorizing for example 100k+ nodes if they are
not in the jackrabbit caches is very expensive.

Either way: You get a correct count if you make sure that you include
in your (xpath) search at least an order by clause. Then, to avoid
100k + hits, make sure you also set a limit. For example a limit of
501 : You can then show 50 pages of 10 hits, and if the count is 501
you state that there are at least 500+ hits

We also wanted to get around this, thus in our api hooked in a
'getTotalSize()' which returns the Lucene unauthorized count

> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*)
> 2) a way to automatically stop long running queries

It is not just about 'long' . Some queries easily blow up, and bring
you app to an OOM before they can be stopped. For example jcr:like is
such a thing. Or range queries on many unique values

Regards Ard

> It would be great if one could define a timeout for queries. If a query takes longer
than X, it should just fail. This should be a global setting, but ideally it should be possible
to override this on a per query basis.
> 3) .. ?
> regards,
> Lukas Kahwe Smith
> mls@pooteeweet.org

Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466

View raw message