jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sbarriba" <sbarr...@yahoo.co.uk>
Subject RE: JCR Query Result Caching
Date Mon, 22 Sep 2008 10:25:43 GMT
Hi Arid,
Having read through your email and
egarding-performance-td15028655.html in more detail with respect to:
 * <param name="respectDocumentOrder" value="false"/>

....if we have JCR types that are explicitly "ordered" will making the above
change mean that all ordering is ignored? We have nodes with
same-name-siblings which we need to be returned in the right order. Or is it
just an issue where there is no default ordering.

Has anyone actually indexed path information? We'd naively assumed limiting
queries using jcr:path was the best way to ensure performance with large
data sets.


-----Original Message-----
From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com] 
Sent: 10 August 2008 11:14
To: users@jackrabbit.apache.org
Subject: RE: JCR Query Result Caching

Hello Shaun,

First of all let me point you to a set of tips I wrote some time ago
about performance for queries, see [1].

> Hi all,
> As our data set increases the overhead of executed JCR 
> queries is increasing. For example, we typically want to 
> display the top 3 latest BlogEntries on a page requiring 
> "select * from acme:BlogEntries where jcr:path like 
> '/home/myblog/%'". Profiling shows Lucene access to be a 
> hotspot under load. Noted that we can review our node 
> structure but ...

After reading the link above, you probably know where the bottleneck is
in: the path '/home/myblog/%' will become the bottleneck. I am not sure
what kind of numbers of nodes you are talking about? Like 1000 blog
entries, of 1.000.000? (see below for a workaround for you problem,
which I am sure of will solve your issue)
> Q1: Does JackRabbit provide any facilities to cache the 
> results of queries such that they can be shared by concurrent 
> sessions for a particular time to live?

Define result of a query? Are you talking about the result of lucene, or
the jr queryresult for instance? Anyway, to answer, lucene has internal
caching, and jackrabbit has a cache for hierarchical relation (which are
needed a lot for your queries since you have '/home/myblog/%'). Also not
that < 1.4 jr version (from top of my head, so not 100% sure) this
hierarchical cache was broken. So it also depends on your version.
Furthermore, I do not think caching the jr QueryResult is a good idea,
and it might be session dependant whether some nodes are allowed in the
result or not. 

> As a query returns a set of JCR Nodes, which in turn are 
> session specific, I'm assuming caching query results is 
> tricky. Caching query results quickly brings us into the 
> realm of transactional semantics, isolation levels etc.

Yes, and you won't find your performance improve here either (obviously,
it will be fast when cached, but not the way to go). Furthermore,
certainly because nodes are lazily fetched, it is not the fetching of
nodes which is slow (unless you want thousands of results at once), but
it is your hierarchical query.

> I'd be interested to hear any experiences in attempting to 
> cache JCR query results?

So, probably by now, you'll know that your bottleneck most likely lies
within the lucene search. If you have <param name="respectDocumentOrder"
value="true"/> (see [1]), this might also be a big performance hit.

So, I think you have 3 options, two involving extending some jackrabbit
code, which you might not like:

1) extend the SearchIndex, and cache certain lucene queries (not exactly
sure which and how, but might be coupled to the kind of queries you are
2) during indexing, also index path information (extend NodeIndexer).
When searching for simple path expression, like /foo/bar//* you can
easily match this to one single lucene term, which is blistering fast up
to millions of nodes. Though realize, you give up the almost free of
charge moving of nodes in Jackrabbit. It is a simple trade-off.

3) if you do not want any programming, than change your sql/xpath query.
Basically, from [1] you should know what the problem is: if the 'where'
clause returns many results (if you don't have a where, it will be all),
all results need to be checked for the path whether they should be
included or not. So, if you can limit the initial set by thinking of
some where clause that returns less hits, the query will become faster.
So you want the last three BlogEntries added, right? So you have a
timestamp property most likely. Now, suppose on average every day 10
entries are added, then, adding to the 'where' clause a constraint that
says: only nodes where timestamp > lastweektimestamp. Now, lots of
results less will be needed to be checked for their path constraint.
Still, results from all over the repository added last week will be in
the result after the 'where' clause. If you also know, that blog entries
are of some specific node type, add this information in the 'where' to
only include those nodes which are of type 'blog'. 

Quite sure that if you follow the ideas from point 3, your queries will
be more then fast enough for millions of nodes in the repository, where
the query you have now probably slows down after several tens of
thousands nodes.

Hopefully you are helped with this info,


[1] http://wiki.apache.org/jackrabbit/Performance

> Regards,
> Shaun

View raw message