jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Query Performance and Optimization
Date Tue, 06 Mar 2007 09:26:17 GMT
Hi David,

David Johnson wrote:
> Yes, I am using Jackrabbit 1.2.x and I am not seeing that dramatic of a
> difference between 1.1.x and the 1.2.x, although I have not done a direct
> comparison between the two with the same query suite.

please note that you have to change the configuration to get the performance 
improvement mentioned by jukka. See: http://issues.apache.org/jira/browse/JCR-651

> It looks like adding
> ordering and or large range queries can significantly impact the query
> time.

I've also noticed this performance hit, but so far didn't find a way to improve 
the situation.

For the interested readers, here are the details:

sorting on custom fields in lucene is imo not very well integrated into the core 
because it was added in a later version (1.4). in addition lucene primarily 
focuses on and is optimized for sorting by relevance.

lucene is also not very well suited for incremental index updates, it is 
optimized for large batch updates and then assumes that an index reader is kept 
open for a long time. to accommodate this limitation, jackrabbit uses its own 
index segments to keep index readers open as long as possible. this allows 
lucene to use internal caches and speeds up index updates.

in certain cases however lucene caches are bound to the top level multi reader 
that spans the jackrabbit index segments. this is unfortunate because this multi 
reader will change whenever the index is updated and as an effect will 
invalidate caches that have a dependency to this multi index.

on of the them is the ScoreDocComparator created in FieldSortedHitQueue which is 
cached using a dependency to the top level multi reader. A ScoreDocComparator 
for relevance sorting is trivial and light weight, but a ScoreDocComparator on a 
field is quite heavy weight and is therefore cached in lucene.

some time ago I thought jackrabbit could use one ScoreDocComparator per index 
segment and keep the comparators on index segments but that requires an even 
more heavy weight version of the comparator that keeps the values of the index 
terms.

> I would be very interested in working on and/or looking into
> optimization strategies and/or experiments.  While I can puzzle out the
> code, and the structure of the query syntax tree, any 
> pointers/documentation
> would be very much appreciated.

any help is very much appreciated. currently there is only one document that 
gives an overview over the query implementation [1] plus the javadoc in the 
sources of course.

if you have any questions, feel free to ask on the dev list.

> In the case of an order by query and large range queries, it looks like
> significant time is spent in the
> org.apache.jackrabbit.core.query.lucene.SharedFiledSortComparator
> class, and that the result set is being walked in order to make sure that
> each result matches the query parameters and probably for sorting purposes.
> In the best case it would be good to have a pre-sorted index so that sorted
> results could be returned much more quickly.  Similarly, the ability to
> define indexes on node properties could be used to speed results.  Both of
> these types of indexes could be modeled and probably implemented similarly
> to well-known database indexing schemes.

jackrabbit already indexes properties using lucene. that is, the lucene query 
handler in jackrabbit already has the ability to enumerate documents (or nodes 
respectively) in sorted order for a certain property. however, the problem here 
is, that lucene initially orders documents using document numbers while it 
calculates the matching documents. This is different to database systems where 
the dbms already uses presorted sets according to the order by clause. so, I'm 
not sure how to optimize sorting in lucene in a dbms way.

hmm, maybe we could use scoring, but I'm not sure how reliable this would 
work... instead of using a regular score how well a document matches against a 
certain term we could implements a score value according to the term value and 
the sort order. this would much better leverage the internal structure and 
workings of lucene. anyone tried this before?

> Finally, there are also optimizations - more tailored to JSR-170 needs -
> that could be implemented with Lucene filters - specifically Node Type 
> query
> constraints - e.g., select * from my:mixin - the Filter could be defined to
> limit query searches / results to only those nodes that fulfill the 
> specific
> node type constraint.  These filters could be defined for each custom node
> type defined in the system and pre-calculated.

again filters are one of the components in a lucene query that have a dependency 
on the top level index reader. because in jackrabbit the index reader changes 
with every update the filter will have to be re-calculated with every update. 
unless we find a way to create and use filters per index segments.

I'm not sure if this would even give much of a performance improvement, because 
node type constraints are basically simple term queries that are very fast in 
lucene because those are the building blocks of more complex queries.

> I have not worked with Lucene Filters before, so I am not familiar with
> their speed - it might be interesting to get input from someone that has
> more experience with the run time behavior of Filters.  Nevertheless, the
> Lucene docs mention Filters as a method to get around large range queries
> that inevitably break the 1024 Lucene query term limit.

filters in lucene are fast once you have them, but usually expensive to 
calculate. because you need to re-calculate them again after you changed the 
index it is important to minimize the number of time you do that. as mentioned 
above I think using filters only makes sense if we can use one filter per index 
segment and only re-calculate those filters for the youngest segment or for 
merged segments.

> Further information on the Query Syntax Tree that is used by the
> LuceneQueryBuilder - and "safe" ways to modify it  i.e., I would rather 
> only
> modify the Query Syntax Tree and continue to use the LuceneQueryBuilder for
> most of the query processing - would be appreciated.

why do you want to modify the query tree?

regards
  marcel

[1] http://jackrabbit.apache.org/doc/arch/operate/query.html

Mime
View raw message