jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Faceted Search Implementation
Date Wed, 25 Aug 2010 09:19:33 GMT
Hello Ian et al,

On Wed, Aug 25, 2010 at 10:43 AM, Ian Boston <ieb@tfd.co.uk> wrote:
>
> On 25 Aug 2010, at 07:55, Ard Schrijvers wrote:
>
>> Also note that the faceted navigation is exposed with including an
>> authorization filter: thus, we expose authorized correct counts
>> faceted navigation, all blistering fast as it is all in Lucene.
>
> Ard,
> I am interested in the counting.
> Is this done by counting the number of results from a search or maintaining an aggregate
counter by events, of by adding a low level Lucene class to generate the count ?

It is the latter: We have chosen to have access rules based on
properties on nodes. (Through some 'auto-derived' property that sets
the path on a node as well, we can also create access rules like
'nothing below this folder', but the actual access checking is still
based on a single property on a node). We have been able to translate
the access rules for this access manager to Lucene Queries (actually
very simple ones, and thus very fast ones).

So, what we have in a nutshell is:

1) When traversing the virtual tree structure of faceted navigation,
the 'fac nav query' grows with new key/value pairs: this is being
translated into a lucene query.
2) The Lucene query from (1) is combined with an Authorization Query
(which could be a cached BitSet as well, but, we do not have
performance issues: I tested for > 300.000 documents exposed over
faceted navigation. It is pretty much instant, even for all kind of
range queries)
3) I am just about to check in a demosuite/site that exposes (1) and
(2) as faceted navigation, with an extra filter, that comes from one
of the jackrabbit queries, like xpath, sql etc. We can expose any
jackrabbit search over authorized faceted navigation with correct
counting. (with (3) however, we suffer from notorious slow range
queries in jackrabbit, but this is something I can hopefully work on
the coming year in the core of jackrabbit)

The online demo here http://www.demo.onehippo.com/  has lots of
faceted stuff, which is just our jcr exposed faceted navigation. We
will include (3) shortly, to also show free text search in combination
with faceted navigation.

If you'd login to the console at :

https://cms.demo.onehippo.com/console/ with admin06 admin06 and you
browse for example to:

/content/documents/hippogogreen/jobfacets

you can see the different coloured maps: these are virtual jcr nodes.
We thus just fetch them over jcr. If you want to see the low-level jcr
properties, you can also go to

https://cms.demo.onehippo.com/repository/

same credentials. It is just another jcr view.

Obviously, as an admin you can destroy the demo: we flush the content
every 2 hours, but still appreciated if you do not completely break it
through the console :-)

>
> I have been looking at generating aggregate counts of facets on large datasets, and have
not found a solution other than retrieving all the hits from a search. JR2.1 appears to be
entirely lazy in its retrieval of results and hence there are no totals until the entire set
is retrieved. Thats fine for small result sets, but for large ones its a killer. At the moment
the best we can do is to count upto some number, (eg 500) and beyond that say there are >
500. Is there a count(*) function in JCR queries?

There is no count(*). I have stopped testing my faceted navigation
exposing facets over ranges after 300.000 documents: It kept being
fast, and did not yet do any caching yet. Will add this when needed.


>
> I dont think this is a problem specific to Jackrabbit, rather its a problem for any search
index on a ACL'd data set where the range of ACL combinations is greater than the number of
items in the set (ie cardinality of the inverted index is so great its pointless indexing)

Yes, this is a general authorized searching issue. Some frameworks,
like Lucene Connectors Framework index documents along with some
'authorisation tokens'. Afaics, when you do it indexing time, this is
only possible when you have very stable ACLs which hardly ever change,
and have a couple of 'authorisation groups' where everybody belongs
to: So, for example, for a shared filesystem in your company, I can
imagine that there are, say, 3 groups: management, managers and the
slaves. Now, indexing three tokens extra per document is easy. Before
querying the index from the LCF, you first ask the connector for a
token of the current user, et voila, you get authorised searched from
say, Solr. *But*, for more complex authorisation rules, or complex
ACLs, I do not see this as an option. However, never asked the LCF
people how they see this.

Regards Ard

>
> Ian

Mime
View raw message