jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: dealing with large result sets
Date Tue, 10 Apr 2012 09:51:15 GMT
On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
<christian.stocker@liip.ch> wrote:
>
>
> On 10.04.12 11:32, Ard Schrijvers wrote:
>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <mls@pooteeweet.org> wrote:
>>> Hi,
>>>
>>> Currently I see some big issues with queries that return large result sets. A
lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never
inside 2.x). However I think its important to do some intermediate improvements.
>>>
>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas
that are feasible to get implemented in a shorter time period than waiting for Oak:
>>>
>>> 1) there should be a way to get a count
>>>
>>> This way if I need to do a query that needs to be ordered, I can first check
if the count is too high to determine if I should even bother running the search. Aka in most
cases a search leading to 100+ results means that who ever did the search needs to further
narrow it down.
>>
>> The cpu is not spend in ordering the results: That is done quite fast
>> in Lucene, unless you have millions of hits
>
> I read the code and also read this
> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
> jackrabbit always sorts the result set by itself and not in lucene (or
> maybe additionally). This makes it slow even if you have a limit set,
> because it first sorts all nodes (fetching it from the PM if necessary),
> then does the limit. Maybe I have missed something but real life tests
> showed exactly this behaviour.

Ah, I don't know about that part: We always sticked to xpath queries :
Sorting is done in Lucene (more precisely, in some Lucene exensions in
jr, but are equally fast) for at least xpath, I am quite sure

>
>>
>> The problem with getting a correct count is authorization : This total
>> search index count should is fast (if you try to avoid some known slow
>> searches). However, authorizing for example 100k+ nodes if they are
>> not in the jackrabbit caches is very expensive.
>>
>> Either way: You get a correct count if you make sure that you include
>> in your (xpath) search at least an order by clause. Then, to avoid
>> 100k + hits, make sure you also set a limit. For example a limit of
>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>> you state that there are at least 500+ hits
>
> That's what we do now, but it doesn't help (as said above) if we have
> thousends of results which have to be ordered first.

And the second sort is also slow? The first sort is also slow with
Lucene, as Lucene needs to load all terms to sort on from FS in
memory. However, consecutive searches are fast. We don't have problems
for resultsets sorting for a million hits

>
>>
>> We also wanted to get around this, thus in our api hooked in a
>> 'getTotalSize()' which returns the Lucene unauthorized count
>
> That would help us a lot, since we currently don't use the ACLs of
> Jackrabbit, so the lucene count would be pretty correct for our use case.

Yes, however, you would have to hook into jr itself to get this done

Regards Ard

>
> chregu
>
>>
>>>
>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*)
FROM ..
>>>
>>> 2) a way to automatically stop long running queries
>>
>> It is not just about 'long' . Some queries easily blow up, and bring
>> you app to an OOM before they can be stopped. For example jcr:like is
>> such a thing. Or range queries on many unique values
>
>
>>
>> Regards Ard
>>
>>>
>>> It would be great if one could define a timeout for queries. If a query takes
longer than X, it should just fail. This should be a global setting, but ideally it should
be possible to override this on a per query basis.
>>>
>>> 3) .. ?
>>>
>>> regards,
>>> Lukas Kahwe Smith
>>> mls@pooteeweet.org
>>>
>>>
>>>
>>
>>
>>
>
> --
> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Mime
View raw message