jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Kiehl <christ...@sulu3000.de>
Subject Re: Query Performance and Optimization
Date Tue, 13 Mar 2007 22:28:08 GMT
David Johnson wrote:

> Out of the Jackrabbit code,
> DescendantSelfAxisQuery.DescendantSelfAxisScorer.next()
> is now taking the most time while executing my query suite - taking 68% of
> the time, within it, calls to
> DescendantSelfAxisQuery.DescendantSelfAxisScorer.calculateSubHits() taking
> the majority of time (basically all of the time).  Then calls to
> BooleanScorer2.score(HitCollector) - back to Lucene code - is taking the
> majority of time.  If more specific profiling data is desired, please feel
> free to ask.  I can also share the profile data in the form of a Netbeans
> profile snapshot.

To my understanding, calculateSubHits() can be divided into to parts:

- The first part queries all nodes that are directly addressed by your xpath 
(for /foo/bar//* this will be /foo/bar[1], /foo/bar[2], ...). This query is 
quite fast in my experience.
- The second part does the actual work, i.e. the lucene query on the node 
attributes. I don't think there is much potential for improvement here unless 
you dig into lucene itself.

On the contrary to DescendantSelfAxisScorer.next(). This method takes the result 
from part two (subHits) and filters all nodes that are not part of the result of 
part one (contextHits) or a child node of one of the nodes in contextHits. To 
filter these nodes a lot of parent-child relations have to resolved. I think 
there should be some caching potential for contextHits here if you use the same 
basis like /foo/bar//* for a lot of queries. But this cache would only be valid 
for a particular IndexReader, that is to say it will only be beneficial if your 
repository is quite stable.

I was digging a bit into Jackrabbit today and found another place where some 
caching did provide a substantial performance gain to queries which check one 
attribute for more than one value (like /foo/*[@foo:bar='john' or 
foo:bar='doe']). The BitSet in calculateDocFilter() is right now created twice 
for the query above. On large repositories this takes about 200ms per BitSet on 
my machine for a particular field. Caching these BitSets per IndexReader and 
field in a WeakHashMap with the IndexReader as a key gave me some real 
improvements. But this caching is as well only beneficial for repositories that 
are not heavily changing, as this will lead to the creation of new IndexReaders 
and invalidate the Cache.

As both mentioned caches rely heavily on IndexReader reuse it would probably be 
better to have caches per index segment as someone mentioned in the thread about 
using Lucene filters, as segment are relatively stable.

That's what I've found out until now. I'll do some more research the next days, 
as we definitely need to improve query performance for our application.

I would like to hear some comments from the JackRabbit gurus and feel free to 
correct me - I just started ;)


View raw message