jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Date Thu, 20 Sep 2012 09:52:50 GMT

Sorry to chime in so late in this thread, hope my remarks are still
welcome. I did read the entire thread, and won't reply in line, but
just try to recap and explain how we got around it in Hippo
repository. The problem is obvious:

*** How to get efficiently a correct count of total hits  when
finegrained authorization is involved ***

Regarding the remark in this thread : 'For example, it doesn't make
any sense to display "1045430 hits" if calculating this number takes
1.5 hours' , I wholeheartedly agree, but our customers *never* agree.
They want the exact hit count, no matter what

So, we tackled this at Hippo as follows.

1) Next to getSize() iterator we also added getTotalSize(). I don't
like the name because it is actually more something like:
getTotalSizeWithoutCheckingACLs(). This method give you back directly
the number of hits from the backing search index. That one is fast of
course. What is slow, is authorizing potentially 1.000.000 hits
because all those nodes need to be fetched from backing storage, etc
etc. However, most of our customers have an application that show the
results for some siteuser below some folder : The siteuser has read
access for the entire folder. We just show the
getTotalSizeWithoutCheckingACLs() as total hits. Worst case, the
number is higher than the actual number the siteuser is allowed to

2) We have our ACLs based on node properties. Hence, we have been able
to create an AuthorizationQuery, mapped directly to a cached Lucene
bitset. When a jcr session searches in our repository, we combine his
cached authorization bitset. After changes in the repository we need
to reload these bitsets (on request) but they are shared between all
users that have the same authorization. It is blistering fast that
way, resulting in correct authorized counts

Now, I don't think (2) can be part of oak, as it implies a certain ACL
model which is not generic enough. Quite some ACL mappings of course
cannot be translated to a Lucene query. However, (1) should be very
issue, and already is a lot better. It is up to the developer then to
use getTotalSizeWithoutCheckingACLs() (and then a decent name :-) or

My 2 cents

Regards Ard

On Tue, Sep 11, 2012 at 12:08 PM, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
> [moving this to oak-dev@ for a broader discussion]
> On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) <jira@apache.org> wrote:
>> [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would be good
>> have a clearly defined way to get the size of the result. [...]
> I've always found the -1 return value from getSize() incredibly
> annoying as it forces client code to use extra conditionals and go
> through extra hoops if the size turns out not to be available. There
> are basically three potential scenarios:
> 1. The client doesn't need to know the size, so it never calls getSize().
> 2. The client does need to know the size, so it calls getSize() and
> has to iterate through all results if getSize() returns -1.
> 3. The client could use the size (for UI, optimization, etc.), so it
> calls getSize() and ignores the result if its -1.
> The main problem I have with the -1 return value is that case 2
> becomes really annoying to handle.
> Instead I'd propose the following design:
> * The getSize() method always returns the size, by buffering all
> results in memory if necessary.
> * A separate hasSize() method can be used to check if the size is
> quickly available (i.e. if getSize() will complete in O(1) time).
> With such a design the above cases become easier to handle:
> 1. The client doesn't need to know the size, so it never calls getSize().
> 2. The client does need to know the size, so it calls getSize().
> 3. The client could use the size (for UI, optimization, etc.), so it
> calls hasSize() and possibly follows up with getSize().
> PS. Note that implementing an "estimated size" feature like seen in
> many public search engines ("results 1-10 of thousands") is really
> difficult to implement in a manner that's both efficient and secure.
> Public search engines can make such estimates efficiently since all
> their content is public and they thus don't need to worry about
> accidentally leaking sensitive information.
> BR,
> Jukka Zitting

Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466

View raw message