jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Query totals - approximations.
Date Thu, 01 Jul 2010 20:45:32 GMT
hello Simon,

On Thu, Jul 1, 2010 at 12:16 PM, Simon Gaeremynck <gaeremyncks@gmail.com> wrote:
> First off I know the question has been asked many times before whether
>  it is possible to get an accurate count from query results.
> I know Jackrabbit only loads the next result when it really has to, which is fine
> since it gives a great performance boost.
> And I also know you can "trick/force" Jackrabbit to return a total by adding a sort in
there but that's not really what we want.
>
> So we thought we might take a Google approach where we say
>  "Displaying first 10 results of approximately 1400000."

Note that Google has, apart from the number of data, obviously quite
an easy job: there is no authorization involved. If you are using
Gmail, check out the number of results they show you there: first 20
of hundreds, or 'first 20 of thousands' : Also note, that gmail is an
entirely domain specific solution, where, it should be easier to show
actual hitcounts.

Obviously, I do not want to talk about Google, but want to give you an
idea about the complexity: Authorized exact counting when you have a
fine grained accessmanager can only be shown correctly if you
authorize every Lucene hit. Extrapolating like you do below is imo
really not a very well solution, see below:

>
> Some more info about this:
> Now, to do this we thought we could get the hit count from Lucene, get the first 10 nodes,
> keep a record of how many Lucene Documents we had to iterate over to get those first
10
> and then do a very rudimentary approximation of how many nodes the user would be able
to see for this query.
>
> ie:
> 1.  Lucene returns a total hitcount of 1.523.145
> 2.  We fetch the first 10 Nodes which results in 452 Documents that needed to be processed
but could not be used because the user doesn't have READ access.
> 3.  Based on these 2 numbers we approximate that the user can see 3370 Nodes.
> 4.  We round this number off to 3300 just to indicate that it's unlikely we guessed
right.
> 5.  The UI displays a message in likes of:
>        "
>          Displaying page 1 of approximately 330
>          Showing 10 results per page.
>        "

imo, you assume that access is evenly scattered over the repository. I
think this is not a realistic assumption. It might be in your case,
but it is not very general. Imo, you certainly cannot extrapolate it
like this.

>
>
>
> Now I had a look at how Jackrabbit executes queries and there seem to be 3 ways it gets
the QueryHits (in JackrabbitIndexSearcher.evaluate)
> - Check if it is a JackrabbitQuery and let the Query implementation deal with it.
> - It is not a JackrabbitQuery and there is no sort required -- use LuceneQueryHits
> - It is not a JackrabbitQuery and there is a sort required -- use SortedLuceneQueryHits
>
> So far I've only been able to get the Lucene hit count from the SortedLuceneQueryHits
because it uses a TopFieldDocCollector and it's very simple to get it from there ^-^.
> All the other ones use the same concept as the Node/Row- Iterators and only load the
next one when asked. (Note: I'm an absolute Lucene novice)
> Maybe this question should be asked on the Lucene list rather than here, but is there
a way to grab the hitcount from a query? (be it Jackrabbit or Lucene)

Getting total hitcount from lucene is really easy, but this is not
where the pain is. It is about authorization. Fine grained
authorization is not manageable to index. This is quite a general
issue between searching and authorization. Caching it is also quite
hard, as Lucene does not have stable ids. At Hippo we have an
accessmanager which acts on properties of documents. I was able to
write to access rules as lucene queries, and used some extra indexing.
This way, instant authorized counting was achieved, which is
especially nice for faceted navigation, which is exposed over jcr as
virtual nodes. But, this all is quite some work, and most likely not
feasible for you. However, I do understand your issue.

So, without only trying to disencourage you, what kind of access
manager do you have? Is it based on properties?

>
>
> Having an approximation of a result total really is a blocker for us.
> Is the above idea doable or is it utter madness?

As we did not yet hook into the jackrabbit search count part, some
customer also faced this problem. He in the end agreed on the
following:

Showing 10 of more then 200 hits

we would limit (and thus authorize) the search to 200. When you go to
200, you can increase the limit , to say 1000

Hope this helps a little,

Ard

>
> My apologies for this very long email.
>
>
> Regards,
> Simon

Mime
View raw message