jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Stocker <christian.stoc...@liip.ch>
Subject Fwd: Re: dealing with large result sets
Date Thu, 12 Apr 2012 06:24:20 GMT
Hi

Posting this to the dev list for further discussion.

The problem mentioned below hit us several times hard lately. If there
are large result sets (with SQL2) jackrabbit gets very slow.

I know it's on your radar or some work was done already, eg. in
https://issues.apache.org/jira/browse/JCR-2959
But there's still an issue with the usecases below.

I looked at the code of jackrabbit and there's obviously no easy fix
(otherwise there would already be a solution), but as it was fast with
old (xpath/sql1) queries (at least as long as there are no JOINS and
all), I guess it would be technically possible somehow.

I also tried on our side to convert some trivial SQL2 queries into XPath
(via our php based QOM), but that got pretty fast pretty messy ;) (by
hand it's of course easy, but I'd need some automated thing for our app)

All I basically need to know is: Are there any plan to fix the issue if
slow (but "easy") queries or do we have to try to solve the problem on
the client side somehow with converting our SQL2 queries to XPath for
some special cases?

And the last question: Is the old SQL query "engine" using the same as
XPath? Meaning it has the same performance characteristics? Because if
we have to solve the problem on our client side for the time beeing, it
may be easier to copy and adjust our QOM2SQL class to output "old" SQL
than XPath (of course with less features than SQL2)

Greetings

chregu


Thanks already

chregu






-------- Original Message --------
Subject: Re: dealing with large result sets
Date: Wed, 11 Apr 2012 09:46:36 +0200
From: Christian Stocker <christian.stocker@liip.ch>
Reply-To: users@jackrabbit.apache.org
Organization: Liip AG
To: users@jackrabbit.apache.org
CC: Ard Schrijvers <a.schrijvers@onehippo.com>

Ok, that made me wondering and I did some short tests on my macbook,
there are approx. 600'000 nodes, which match those queries

With xpath, without ordering


<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 455 ms
2nd run:  42 ms

With xpath, with order by

<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0">
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
order by @firstImportDate

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 2555 ms
2nd run:   16 ms


Those numbers seem to be reasonable.

With SQL2 without ordering:

<D:searchrequest xmlns:D="DAV:">
	<JCR-SQL2>
		<![CDATA[
		SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] =
'Own\ApiBundle\Document\Article'  AND  ISDESCENDANTNODE(data, '/article')
		]]>
	</JCR-SQL2>
	<D:limit>
		<D:nresults>10</D:nresults>
	</D:limit>
</D:searchrequest>

1st run: 2'006'634 ms (33 minutes.)

>From the log

 SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns:
[data.jcr:primaryType], constraint: (data.[phpcr:class] =
'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data,
[/article])), offset 0, limit 10
 SQL2 SORT took 1479 ms.
 SQL2 QUERY execute took 2006634 ms. native sort is false.


With those results, I didn't even try a 2nd time (caches are full
anyway) or with ordering.

Something seems to be quite wrong here. If you want more measurements,
just tell me

Greetings

chregu





On 10.04.12 11:55, Christian Stocker wrote:
> 
> 
> On 10.04.12 11:51, Ard Schrijvers wrote:
>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
>> <christian.stocker@liip.ch> wrote:
>>>
>>>
>>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <mls@pooteeweet.org>
wrote:
>>>>> Hi,
>>>>>
>>>>> Currently I see some big issues with queries that return large result
sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe
never inside 2.x). However I think its important to do some intermediate improvements.
>>>>>
>>>>> Here are some suggestions I have. I hope we can brainstorm together on
some ideas that are feasible to get implemented in a shorter time period than waiting for
Oak:
>>>>>
>>>>> 1) there should be a way to get a count
>>>>>
>>>>> This way if I need to do a query that needs to be ordered, I can first
check if the count is too high to determine if I should even bother running the search. Aka
in most cases a search leading to 100+ results means that who ever did the search needs to
further narrow it down.
>>>>
>>>> The cpu is not spend in ordering the results: That is done quite fast
>>>> in Lucene, unless you have millions of hits
>>>
>>> I read the code and also read this
>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>>> jackrabbit always sorts the result set by itself and not in lucene (or
>>> maybe additionally). This makes it slow even if you have a limit set,
>>> because it first sorts all nodes (fetching it from the PM if necessary),
>>> then does the limit. Maybe I have missed something but real life tests
>>> showed exactly this behaviour.
>>
>> Ah, I don't know about that part: We always sticked to xpath queries :
>> Sorting is done in Lucene (more precisely, in some Lucene exensions in
>> jr, but are equally fast) for at least xpath, I am quite sure
> 
> Is the search part done differently in SQL2 and XPath? Can't remember ;)
> 
>>>> The problem with getting a correct count is authorization : This total
>>>> search index count should is fast (if you try to avoid some known slow
>>>> searches). However, authorizing for example 100k+ nodes if they are
>>>> not in the jackrabbit caches is very expensive.
>>>>
>>>> Either way: You get a correct count if you make sure that you include
>>>> in your (xpath) search at least an order by clause. Then, to avoid
>>>> 100k + hits, make sure you also set a limit. For example a limit of
>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>>> you state that there are at least 500+ hits
>>>
>>> That's what we do now, but it doesn't help (as said above) if we have
>>> thousends of results which have to be ordered first.
>>
>> And the second sort is also slow? The first sort is also slow with
>> Lucene, as Lucene needs to load all terms to sort on from FS in
>> memory. However, consecutive searches are fast. We don't have problems
>> for resultsets sorting for a million hits
> 
> It definitively loaded all nodes from the PM before sorting it. The
> lucene part itself was fast enough, that wasn't the issue.
> 
>>
>>>
>>>>
>>>> We also wanted to get around this, thus in our api hooked in a
>>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>>
>>> That would help us a lot, since we currently don't use the ACLs of
>>> Jackrabbit, so the lucene count would be pretty correct for our use case.
>>
>> Yes, however, you would have to hook into jr itself to get this done
> 
> Yep, saw that, that's somewhere deep in the code. That's why I didn't
> try to adress that yet
> 
> chregu
> 
>>
>> Regards Ard
>>
>>>
>>> chregu
>>>
>>>>
>>>>>
>>>>> I guess the most sensible thing would be to simply offer a way to do
SELECT COUNT(*) FROM ..
>>>>>
>>>>> 2) a way to automatically stop long running queries
>>>>
>>>> It is not just about 'long' . Some queries easily blow up, and bring
>>>> you app to an OOM before they can be stopped. For example jcr:like is
>>>> such a thing. Or range queries on many unique values
>>>
>>>
>>>>
>>>> Regards Ard
>>>>
>>>>>
>>>>> It would be great if one could define a timeout for queries. If a query
takes longer than X, it should just fail. This should be a global setting, but ideally it
should be possible to override this on a per query basis.
>>>>>
>>>>> 3) .. ?
>>>>>
>>>>> regards,
>>>>> Lukas Kahwe Smith
>>>>> mls@pooteeweet.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>>
>>
>>
>>
> 

Mime
View raw message