jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Stocker <christian.stoc...@liip.ch>
Subject Re: dealing with large result sets
Date Tue, 10 Apr 2012 09:42:56 GMT


On 10.04.12 11:32, Ard Schrijvers wrote:
> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <mls@pooteeweet.org> wrote:
>> Hi,
>>
>> Currently I see some big issues with queries that return large result sets. A lot
of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside
2.x). However I think its important to do some intermediate improvements.
>>
>> Here are some suggestions I have. I hope we can brainstorm together on some ideas
that are feasible to get implemented in a shorter time period than waiting for Oak:
>>
>> 1) there should be a way to get a count
>>
>> This way if I need to do a query that needs to be ordered, I can first check if the
count is too high to determine if I should even bother running the search. Aka in most cases
a search leading to 100+ results means that who ever did the search needs to further narrow
it down.
> 
> The cpu is not spend in ordering the results: That is done quite fast
> in Lucene, unless you have millions of hits

I read the code and also read this
https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
jackrabbit always sorts the result set by itself and not in lucene (or
maybe additionally). This makes it slow even if you have a limit set,
because it first sorts all nodes (fetching it from the PM if necessary),
then does the limit. Maybe I have missed something but real life tests
showed exactly this behaviour.

> 
> The problem with getting a correct count is authorization : This total
> search index count should is fast (if you try to avoid some known slow
> searches). However, authorizing for example 100k+ nodes if they are
> not in the jackrabbit caches is very expensive.
> 
> Either way: You get a correct count if you make sure that you include
> in your (xpath) search at least an order by clause. Then, to avoid
> 100k + hits, make sure you also set a limit. For example a limit of
> 501 : You can then show 50 pages of 10 hits, and if the count is 501
> you state that there are at least 500+ hits

That's what we do now, but it doesn't help (as said above) if we have
thousends of results which have to be ordered first.

> 
> We also wanted to get around this, thus in our api hooked in a
> 'getTotalSize()' which returns the Lucene unauthorized count

That would help us a lot, since we currently don't use the ACLs of
Jackrabbit, so the lucene count would be pretty correct for our use case.

chregu

> 
>>
>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*)
FROM ..
>>
>> 2) a way to automatically stop long running queries
> 
> It is not just about 'long' . Some queries easily blow up, and bring
> you app to an OOM before they can be stopped. For example jcr:like is
> such a thing. Or range queries on many unique values


> 
> Regards Ard
> 
>>
>> It would be great if one could define a timeout for queries. If a query takes longer
than X, it should just fail. This should be a global setting, but ideally it should be possible
to override this on a per query basis.
>>
>> 3) .. ?
>>
>> regards,
>> Lukas Kahwe Smith
>> mls@pooteeweet.org
>>
>>
>>
> 
> 
> 

-- 
Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE


Mime
View raw message