jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Gaeremynck <gaeremyn...@gmail.com>
Subject Re: Query totals - approximations.
Date Wed, 07 Jul 2010 08:44:18 GMT
Hi Ard,

My apologies for waiting so long to respond.

See response inline

On 1 Jul 2010, at 21:45, Ard Schrijvers wrote:

> hello Simon,
> On Thu, Jul 1, 2010 at 12:16 PM, Simon Gaeremynck <gaeremyncks@gmail.com> wrote:
>> First off I know the question has been asked many times before whether
>>  it is possible to get an accurate count from query results.
>> I know Jackrabbit only loads the next result when it really has to, which is fine
>> since it gives a great performance boost.
>> And I also know you can "trick/force" Jackrabbit to return a total by adding a sort
in there but that's not really what we want.
>> So we thought we might take a Google approach where we say
>>  "Displaying first 10 results of approximately 1400000."
> Note that Google has, apart from the number of data, obviously quite
> an easy job: there is no authorization involved. If you are using
> Gmail, check out the number of results they show you there: first 20
> of hundreds, or 'first 20 of thousands' : Also note, that gmail is an
> entirely domain specific solution, where, it should be easier to show
> actual hitcounts.
> Obviously, I do not want to talk about Google, but want to give you an
> idea about the complexity: Authorized exact counting when you have a
> fine grained accessmanager can only be shown correctly if you
> authorize every Lucene hit. Extrapolating like you do below is imo
> really not a very well solution, see below:

Could you elaborate on "authorizing every Lucene hit"?
AFAIK that is what Jackrabbit does?

>> Some more info about this:
>> Now, to do this we thought we could get the hit count from Lucene, get the first
10 nodes,
>> keep a record of how many Lucene Documents we had to iterate over to get those first
>> and then do a very rudimentary approximation of how many nodes the user would be
able to see for this query.
>> ie:
>> 1.  Lucene returns a total hitcount of 1.523.145
>> 2.  We fetch the first 10 Nodes which results in 452 Documents that needed to be
processed but could not be used because the user doesn't have READ access.
>> 3.  Based on these 2 numbers we approximate that the user can see 3370 Nodes.
>> 4.  We round this number off to 3300 just to indicate that it's unlikely we guessed
>> 5.  The UI displays a message in likes of:
>>        "
>>          Displaying page 1 of approximately 330
>>          Showing 10 results per page.
>>        "
> imo, you assume that access is evenly scattered over the repository. I
> think this is not a realistic assumption. It might be in your case,
> but it is not very general. Imo, you certainly cannot extrapolate it
> like this.

Yes, I know and it's far from perfect.
It is however a start, at least we would be able to give the user some
indication (however poorly it is.)

>> Now I had a look at how Jackrabbit executes queries and there seem to be 3 ways it
gets the QueryHits (in JackrabbitIndexSearcher.evaluate)
>> - Check if it is a JackrabbitQuery and let the Query implementation deal with it.
>> - It is not a JackrabbitQuery and there is no sort required -- use LuceneQueryHits
>> - It is not a JackrabbitQuery and there is a sort required -- use SortedLuceneQueryHits
>> So far I've only been able to get the Lucene hit count from the SortedLuceneQueryHits
because it uses a TopFieldDocCollector and it's very simple to get it from there ^-^.
>> All the other ones use the same concept as the Node/Row- Iterators and only load
the next one when asked. (Note: I'm an absolute Lucene novice)
>> Maybe this question should be asked on the Lucene list rather than here, but is there
a way to grab the hitcount from a query? (be it Jackrabbit or Lucene)
> Getting total hitcount from lucene is really easy, but this is not
> where the pain is. It is about authorization. Fine grained
> authorization is not manageable to index. This is quite a general
> issue between searching and authorization. Caching it is also quite
> hard, as Lucene does not have stable ids. At Hippo we have an
> accessmanager which acts on properties of documents. I was able to
> write to access rules as lucene queries, and used some extra indexing.
> This way, instant authorized counting was achieved, which is
> especially nice for faceted navigation, which is exposed over jcr as
> virtual nodes. But, this all is quite some work, and most likely not
> feasible for you. However, I do understand your issue.
> So, without only trying to disencourage you, what kind of access
> manager do you have? Is it based on properties?

We use the default access manager in Jackrabbit + some extensions of our own.
These extensions include Dynamic ACE.
ie: Date-based ACE.
if currentTime < timeOnAceNode then user has jcr:read=none

We do not know the full extent of these Dynamic rules as they are
hooked up to Drools and we allow admins/managers to write their 
own custom rules.

>> Having an approximation of a result total really is a blocker for us.
>> Is the above idea doable or is it utter madness?
> As we did not yet hook into the jackrabbit search count part, some
> customer also faced this problem. He in the end agreed on the
> following:
> Showing 10 of more then 200 hits
> we would limit (and thus authorize) the search to 200. When you go to
> 200, you can increase the limit , to say 1000

Can you elaborate on this?
We currently limit the amount of nodes a person can retrieve trough searching.
Are you doing the same then?

> Hope this helps a little,
> Ard
>> My apologies for this very long email.
>> Regards,
>> Simon

View raw message