Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 50381 invoked from network); 1 Jul 2010 10:17:29 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Jul 2010 10:17:29 -0000 Received: (qmail 1620 invoked by uid 500); 1 Jul 2010 10:17:28 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 1433 invoked by uid 500); 1 Jul 2010 10:17:26 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 1424 invoked by uid 99); 1 Jul 2010 10:17:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jul 2010 10:17:25 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of gaeremyncks@gmail.com designates 209.85.215.42 as permitted sender) Received: from [209.85.215.42] (HELO mail-ew0-f42.google.com) (209.85.215.42) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jul 2010 10:17:15 +0000 Received: by ewy2 with SMTP id 2so928168ewy.1 for ; Thu, 01 Jul 2010 03:16:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:content-type :content-transfer-encoding:subject:date:message-id:to:mime-version :x-mailer; bh=N4Xmo3RMbVQt4pB0Rrkez/ijmeMHcz9/CzZCYSVnQAY=; b=pQNRqIJSbRVm9riRWdQfKIqLEWeh254DdYjE6M5jKt7xSHh8Punn5v/z0/Hiq6c2Mo YscKRnr5SuHCY82PZI98GbhGxFbP7muYqIrvdmnrjk1ZCmBgIbHTICPWnn8AMakxvTl2 ane7sE2FpGiz2acF6lWCUG8uNkfC73hakNico= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:content-type:content-transfer-encoding:subject:date:message-id :to:mime-version:x-mailer; b=FEysAhxiUSUhB0Thx6MZ95Nkb5hUQafltXzin63NATEnsVVpZtxRPVgmA2ELN2OKcQ b8HcVc0kJSEf4R4D6y7TNPAHrA77wu4uvKLQTSBFMYWttuEbGiKkadTFuJvFtg1SbQjA OGaBHsoHTwQ/uFTaJN1gZEXmquhSW6BFG8dHM= Received: by 10.213.28.134 with SMTP id m6mr2060828ebc.17.1277979366652; Thu, 01 Jul 2010 03:16:06 -0700 (PDT) Received: from dhcp-56.caret.private (ginger.caret.cam.ac.uk [131.111.21.21]) by mx.google.com with ESMTPS id b49sm6759879eei.17.2010.07.01.03.16.05 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 01 Jul 2010 03:16:06 -0700 (PDT) From: Simon Gaeremynck Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Query totals - approximations. Date: Thu, 1 Jul 2010 11:16:04 +0100 Message-Id: <4514E33C-3ECD-46EB-AE6D-3C6F7435E787@gmail.com> To: users@jackrabbit.apache.org Mime-Version: 1.0 (Apple Message framework v1078) X-Mailer: Apple Mail (2.1078) X-Virus-Checked: Checked by ClamAV on apache.org First off I know the question has been asked many times before whether it is possible to get an accurate count from query results. I know Jackrabbit only loads the next result when it really has to, = which is fine since it gives a great performance boost. And I also know you can "trick/force" Jackrabbit to return a total by = adding a sort in there but that's not really what we want. So we thought we might take a Google approach where we say "Displaying first 10 results of approximately 1400000." Some more info about this: Now, to do this we thought we could get the hit count from Lucene, get = the first 10 nodes, keep a record of how many Lucene Documents we had to iterate over to get = those first 10 and then do a very rudimentary approximation of how many nodes the user = would be able to see for this query. ie: 1. Lucene returns a total hitcount of 1.523.145 2. We fetch the first 10 Nodes which results in 452 Documents that = needed to be processed but could not be used because the user doesn't = have READ access. 3. Based on these 2 numbers we approximate that the user can see 3370 = Nodes. 4. We round this number off to 3300 just to indicate that it's unlikely = we guessed right. 5. The UI displays a message in likes of: " Displaying page 1 of approximately 330 Showing 10 results per page. " Now I had a look at how Jackrabbit executes queries and there seem to be = 3 ways it gets the QueryHits (in JackrabbitIndexSearcher.evaluate) - Check if it is a JackrabbitQuery and let the Query implementation deal = with it. - It is not a JackrabbitQuery and there is no sort required -- use = LuceneQueryHits - It is not a JackrabbitQuery and there is a sort required -- use = SortedLuceneQueryHits So far I've only been able to get the Lucene hit count from the = SortedLuceneQueryHits because it uses a TopFieldDocCollector and it's = very simple to get it from there ^-^. All the other ones use the same concept as the Node/Row- Iterators and = only load the next one when asked. (Note: I'm an absolute Lucene novice) Maybe this question should be asked on the Lucene list rather than here, = but is there a way to grab the hitcount from a query? (be it Jackrabbit = or Lucene) Having an approximation of a result total really is a blocker for us. Is the above idea doable or is it utter madness? My apologies for this very long email. Regards, Simon=