From users-return-18760-apmail-jackrabbit-users-archive=jackrabbit.apache.org@jackrabbit.apache.org Wed Apr 11 07:47:14 2012 Return-Path: X-Original-To: apmail-jackrabbit-users-archive@minotaur.apache.org Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1EFF9CF4 for ; Wed, 11 Apr 2012 07:47:14 +0000 (UTC) Received: (qmail 56309 invoked by uid 500); 11 Apr 2012 07:47:14 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 56008 invoked by uid 500); 11 Apr 2012 07:47:09 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 55970 invoked by uid 99); 11 Apr 2012 07:47:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2012 07:47:08 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of christian.stocker@liip.ch designates 207.126.144.127 as permitted sender) Received: from [207.126.144.127] (HELO eu1sys200aog109.obsmtp.com) (207.126.144.127) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2012 07:47:00 +0000 Received: from mail-we0-f172.google.com ([74.125.82.172]) (using TLSv1) by eu1sys200aob109.postini.com ([207.126.147.11]) with SMTP ID DSNKT4U24M4puCHjT9+1dmiMlyOCuXaUEWvp@postini.com; Wed, 11 Apr 2012 07:46:40 UTC Received: by werb10 with SMTP id b10so424605wer.3 for ; Wed, 11 Apr 2012 00:46:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:organization:user-agent:mime-version:to:cc :subject:references:in-reply-to:x-enigmail-version:openpgp :content-type:content-transfer-encoding:x-gm-message-state; bh=F1b2ArYG9+ZuQcdnMh3gW+GmabAAD6B6xiiiQKeXTrs=; b=YhIw3c83NS8S3ZayQ/Q96Rzq3xsY2JoM5/pxhV4GEtSzxP0jLSbctylzKzYbZ8JXGE agqoUFnAVwdWy7z4rq72Ccahmjp25CUpegpO8H5vDTMMWUwyxzoa9mj9wpzi+6P4FTly n7T/c3F9Plg4t2mfide/24pSgHhJd3RYmfAfod/c0WEODqiH0ENXmc+G7JphDGbhObdp qqkQ5xBcWYQmfFYEXQnUXgKKerYmtMwxJ3jxzbiXzucg8/VlQoR1KO8WoFN9wg258PSp 0yDQ7kBVeufcgSyK2JBqYOXMhUhhCXn3+7/WSyorat74khq+NR9fLo8wWGfwO5EeGwHl GpvQ== Received: by 10.180.102.102 with SMTP id fn6mr4104968wib.10.1334130399249; Wed, 11 Apr 2012 00:46:39 -0700 (PDT) Received: from air11.local (84-72-46-53.dclient.hispeed.ch. [84.72.46.53]) by mx.google.com with ESMTPS id ca3sm34448220wib.6.2012.04.11.00.46.37 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 11 Apr 2012 00:46:38 -0700 (PDT) Message-ID: <4F8536DC.9080906@liip.ch> Date: Wed, 11 Apr 2012 09:46:36 +0200 From: Christian Stocker Organization: Liip AG User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 MIME-Version: 1.0 To: users@jackrabbit.apache.org CC: Ard Schrijvers Subject: Re: dealing with large result sets References: <1CAFB6FB-25EA-4938-B2D8-23A18FAC92F2@pooteeweet.org> <4F8400A0.5040402@liip.ch> <4F84038A.9090905@liip.ch> In-Reply-To: <4F84038A.9090905@liip.ch> X-Enigmail-Version: 1.4 OpenPGP: id=0748D5FE; url=http://gpg-keyserver.de/pks/lookup?op=get&search=0xC2BAFBC30748D5FE Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Gm-Message-State: ALoCoQmb0R9qneJWAd6X/ssMrgR24p6SnlIxvmP5m37cHBQDWM4IMs4J3wIkJZFh4Akgx/5LKkeM Ok, that made me wondering and I did some short tests on my macbook, there are approx. 600'000 nodes, which match those queries With xpath, without ordering /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article'] 10 1st run: 455 ms 2nd run: 42 ms With xpath, with order by /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article'] order by @firstImportDate 10 1st run: 2555 ms 2nd run: 16 ms Those numbers seem to be reasonable. With SQL2 without ordering: 10 1st run: 2'006'634 ms (33 minutes.) >From the log SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns: [data.jcr:primaryType], constraint: (data.[phpcr:class] = 'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data, [/article])), offset 0, limit 10 SQL2 SORT took 1479 ms. SQL2 QUERY execute took 2006634 ms. native sort is false. With those results, I didn't even try a 2nd time (caches are full anyway) or with ordering. Something seems to be quite wrong here. If you want more measurements, just tell me Greetings chregu On 10.04.12 11:55, Christian Stocker wrote: > > > On 10.04.12 11:51, Ard Schrijvers wrote: >> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker >> wrote: >>> >>> >>> On 10.04.12 11:32, Ard Schrijvers wrote: >>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith wrote: >>>>> Hi, >>>>> >>>>> Currently I see some big issues with queries that return large result sets. A lot of work is not done inside Lucene, which will probably not be fixed soon (or maybe never inside 2.x). However I think its important to do some intermediate improvements. >>>>> >>>>> Here are some suggestions I have. I hope we can brainstorm together on some ideas that are feasible to get implemented in a shorter time period than waiting for Oak: >>>>> >>>>> 1) there should be a way to get a count >>>>> >>>>> This way if I need to do a query that needs to be ordered, I can first check if the count is too high to determine if I should even bother running the search. Aka in most cases a search leading to 100+ results means that who ever did the search needs to further narrow it down. >>>> >>>> The cpu is not spend in ordering the results: That is done quite fast >>>> in Lucene, unless you have millions of hits >>> >>> I read the code and also read this >>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that >>> jackrabbit always sorts the result set by itself and not in lucene (or >>> maybe additionally). This makes it slow even if you have a limit set, >>> because it first sorts all nodes (fetching it from the PM if necessary), >>> then does the limit. Maybe I have missed something but real life tests >>> showed exactly this behaviour. >> >> Ah, I don't know about that part: We always sticked to xpath queries : >> Sorting is done in Lucene (more precisely, in some Lucene exensions in >> jr, but are equally fast) for at least xpath, I am quite sure > > Is the search part done differently in SQL2 and XPath? Can't remember ;) > >>>> The problem with getting a correct count is authorization : This total >>>> search index count should is fast (if you try to avoid some known slow >>>> searches). However, authorizing for example 100k+ nodes if they are >>>> not in the jackrabbit caches is very expensive. >>>> >>>> Either way: You get a correct count if you make sure that you include >>>> in your (xpath) search at least an order by clause. Then, to avoid >>>> 100k + hits, make sure you also set a limit. For example a limit of >>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501 >>>> you state that there are at least 500+ hits >>> >>> That's what we do now, but it doesn't help (as said above) if we have >>> thousends of results which have to be ordered first. >> >> And the second sort is also slow? The first sort is also slow with >> Lucene, as Lucene needs to load all terms to sort on from FS in >> memory. However, consecutive searches are fast. We don't have problems >> for resultsets sorting for a million hits > > It definitively loaded all nodes from the PM before sorting it. The > lucene part itself was fast enough, that wasn't the issue. > >> >>> >>>> >>>> We also wanted to get around this, thus in our api hooked in a >>>> 'getTotalSize()' which returns the Lucene unauthorized count >>> >>> That would help us a lot, since we currently don't use the ACLs of >>> Jackrabbit, so the lucene count would be pretty correct for our use case. >> >> Yes, however, you would have to hook into jr itself to get this done > > Yep, saw that, that's somewhere deep in the code. That's why I didn't > try to adress that yet > > chregu > >> >> Regards Ard >> >>> >>> chregu >>> >>>> >>>>> >>>>> I guess the most sensible thing would be to simply offer a way to do SELECT COUNT(*) FROM .. >>>>> >>>>> 2) a way to automatically stop long running queries >>>> >>>> It is not just about 'long' . Some queries easily blow up, and bring >>>> you app to an OOM before they can be stopped. For example jcr:like is >>>> such a thing. Or range queries on many unique values >>> >>> >>>> >>>> Regards Ard >>>> >>>>> >>>>> It would be great if one could define a timeout for queries. If a query takes longer than X, it should just fail. This should be a global setting, but ideally it should be possible to override this on a per query basis. >>>>> >>>>> 3) .. ? >>>>> >>>>> regards, >>>>> Lukas Kahwe Smith >>>>> mls@pooteeweet.org >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> -- >>> Liip AG // Feldstrasse 133 // CH-8004 Zurich >>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60 >>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE >>> >> >> >> >