jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Kiehl <christ...@sulu3000.de>
Subject Optimize search performance
Date Fri, 08 Jun 2007 10:50:15 GMT
Hi everyone,

I had a look at the search related code during the last days, because we need 
better performance for range queries on date fields as well as for sorting by 
date fields. These are my thoughts so far:

1. Wouldn't it make sense to exclude the index for the "jcr:system" tree (which 
is located at repository/index by default) if the query to execute doesn't 
include items from the "jcr:system" tree.
Take for example a query like "my:app//element(*, foo:bar)". This query only 
searches for nodes located under "my:app" which excludes nodes from "jcr:system" 
and therefore doesn't need to search in the "jcr:system" index.
As the "jcr:system" might grow quite quickly if you create a lot versions it 
might be worth to exclude it.
I'm not sure though how hard it would be to find out if a query needs to include 
the "jcr:system" index.

2. Lucene uses the FieldCaches to speed up sorting and range queries which is 
exactly what we are after. Those FieldCaches are per IndexReader.
Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is 
most likely to be an instance of CachingMultiReader. So on every search which 
builds up a FieldCache this FieldCache instance is associated with this instance 
of a CachingMultiReader. On successive queries which operate on this 
CachingMultiReader you will get a tremendous speedup for queries which can reuse 
  those associated FieldCache instances.
The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one 
of the underlying indexes are modified. This means if you just change _one_ item 
in the repository you will need to rebuild all those FieldCaches because the 
existing FieldCaches are associated with the old instance of CachingMultiReader.
This does not only lead to slow search response times for queries which contains 
range queries or are sorted by a field but also leads to massive memory 
consumption (depending on the size of your indexes) because there might be 
multiple instances of CachingMultiReaders in use if you have a scenario where a 
lot of queries and item modifications are executed concurrently.
As far as I understand the solution is to use a MultiSearcher which uses 
multiple IndexReaders. Since due to the merging strategy most of the indexes are 
stable this means the FieldCaches can be used for a much longer time.

I just tried to quickly modify SearchIndex to use a MultiSearcher with multiple 
IndexReaders wrapped by IndexSearchers but wasn't successful because somewhere 
in DescendantSelfAxisWeight the index readers are required to implement 
HierarchyResolver which ReadOnlyIndexReader doesn't.

So I thought I might ask you for some insight what you think about those two 
ideas before spending to much time walking down the wrong way ;)

Cheers,
Christoph


Mime
View raw message