jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Gaeremynck <gaeremyn...@gmail.com>
Subject Query totals - approximations.
Date Thu, 01 Jul 2010 10:16:04 GMT
First off I know the question has been asked many times before whether
 it is possible to get an accurate count from query results.
I know Jackrabbit only loads the next result when it really has to, which is fine
since it gives a great performance boost.
And I also know you can "trick/force" Jackrabbit to return a total by adding a sort in there
but that's not really what we want.

So we thought we might take a Google approach where we say
  "Displaying first 10 results of approximately 1400000."

Some more info about this:
Now, to do this we thought we could get the hit count from Lucene, get the first 10 nodes,
keep a record of how many Lucene Documents we had to iterate over to get those first 10
and then do a very rudimentary approximation of how many nodes the user would be able to see
for this query.

1.  Lucene returns a total hitcount of 1.523.145
2.  We fetch the first 10 Nodes which results in 452 Documents that needed to be processed
but could not be used because the user doesn't have READ access.
3.  Based on these 2 numbers we approximate that the user can see 3370 Nodes.
4.  We round this number off to 3300 just to indicate that it's unlikely we guessed right.
5.  The UI displays a message in likes of:
          Displaying page 1 of approximately 330
          Showing 10 results per page.

Now I had a look at how Jackrabbit executes queries and there seem to be 3 ways it gets the
QueryHits (in JackrabbitIndexSearcher.evaluate)
- Check if it is a JackrabbitQuery and let the Query implementation deal with it.
- It is not a JackrabbitQuery and there is no sort required -- use LuceneQueryHits
- It is not a JackrabbitQuery and there is a sort required -- use SortedLuceneQueryHits

So far I've only been able to get the Lucene hit count from the SortedLuceneQueryHits because
it uses a TopFieldDocCollector and it's very simple to get it from there ^-^.
All the other ones use the same concept as the Node/Row- Iterators and only load the next
one when asked. (Note: I'm an absolute Lucene novice)
Maybe this question should be asked on the Lucene list rather than here, but is there a way
to grab the hitcount from a query? (be it Jackrabbit or Lucene)

Having an approximation of a result total really is a blocker for us.
Is the above idea doable or is it utter madness?

My apologies for this very long email.

View raw message