jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Query Performance and Optimization
Date Fri, 09 Mar 2007 14:25:32 GMT
David Johnson wrote:
> In my last tests, I think I have done this - through parameters in the
> repository.xml file and recreating the entire repository.  Nevertheless, I
> did not see that significant of a speed change in query response.  
> Perhaps I
> wasn't using a small enough resultFetchSize (128)?  I also set
> respectDocumentOrder to false.  Other suggestions?

no, those two parameters will affect the performance most notably.

> In my tests so far, I am not making any changes to the repository while
> running the queries.  Could this be considered a best case scenario - i.e.,
> the Lucene indexes are not being updated?

yes, this should ensure that caching in lucene is used wherever possible. Even 
though there might be bugs that prevent this. Just like this one:

http://svn.apache.org/viewvc?view=rev&revision=506908

which prevented the re-use of SharedFiledSortComparator even if nothing changed 
between two query execution calls. You might want to check if this patch 
improves the situation for you.

> What would be the expected
> performance change if I have ongoing updates while querying the system?

depending on the query the performance will be just a bit slower or 
significantly slower. e.g. an order by in your query on a property that is 
nearly on every node will significantly slow down the query (-> see my previous 
post about lucenes ability to cache the SharedFiledSortComparator).

> So far the experiments that I have done with Lucene filters and Jackrabbit
> have been disappointing.  I essentially used a QueryFilter and passed that
> to the IndexSearcher in SearchIndex.executeQuery.  I created filters for 
> one
> sub-part of a query - NodeType matching created in public Object
> visit(NodeTypeQueryNode node, Object data) of the LuceneQueryBuilder class.
> Rather than adding its part of the query to the larger Lucene query, I had
> the function create a QueryFilter using the sub-part of the Lucene query
> that would have been created by the function, and then returning null
> instead of the query.  The filter was then later combined with the rest of
> the query in IndexSearcher.  Finally, the filter was only created once, and
> added to a filter map, so that it could be reused for queries against the
> same nodetype.

I still think that caching the documents that match a node type will not help 
much because those are simple term query, which are very efficient in lucene.

> I also noticed that a new IndexSearcher was created for each query 
> processed
> - is this breaking the Lucene cache of filters?

No, lucene uses the index reader as the key for its caches. e.g. in 
QueryFilter.bits():

     synchronized (cache) {  // check cache
       BitSet cached = (BitSet) cache.get(reader);
       if (cached != null) {
         return cached;
       }
     }


> This is what I was hoping for, although I am wondering if the new
> IndexSearcher that is created for each query execution is eliminating the
> filter cache?

afaics all filter caches are tied to the index reader and not the searcher instance.

> I needed to "modify" the query tree so that the parts I "optimized" (i.e.,
> removed) wouldn't create additional Lucene query terms.  It seems that
> having the visitor function return null effectively removes any terms that
> may have been created by that part of the visitor - in LuceneQueryBuilder.
> Is this correct?

I think writing your own customized LuceneQueryBuilder is easier than modifying 
the query tree.

> In my explorations, I have noticed that a range query e.g., (myDate >
> startDate AND myDate < endDate) seem to be translated into two Lucene range
> queries that are ANDed together that looks like - ((startDate < myDate <
> MAX_DATE) AND (MIN_DATE < myDate < endDate)).  I am guessing that Lucene
> calculates each of the two range queries in isolation, and then ANDs the
> results together - in a sense forcing the walk of the entire document set.

that's correct.

> Just from the look of it, it seems inefficient, although I am not sure how
> much better it would be to translate the query to a single range query -
> this transformation would also require a little analysis on the Query 
> Syntax
> Tree - or a post optimization on the Lucene query.

yes, this would be very useful and probably improve the performance 
significantly for certain cases. the LuceneQueryBuilder should probably do this 
optimization by analyzing the query tree.

> Finally, some ideas on indexing.  It seems that there are two or perhaps
> three choices that could be made for improving indexing:
> 
> 1) Augment the already existing Lucene index with additional information to
> speed certain types of queries - e.g., range queries.  For example, it 
> might
> be possible to index each of the "bytes" of a date or long and optimize the
> query using this additional information.

Can you please elaborate on how this would work?

> 2) Create an external indexing structure for nodetypes and fields that 
> would
> mirror similar structures in a database.  Again this data could be used to
> optimize range queries as well as "sorted" results.

I'm not sure how you would connect those two structures because lucene uses 
ephemeral document number:

"IndexReader: For efficiency, in this API documents are often referred to via 
document numbers, non-negative integers which each name a unique document in the 
index. These document numbers are ephemeral--they may change as documents are 
added to and deleted from an index. Clients should thus not rely on a given 
document having the same number between sessions."

> 3) Use the database to provide the indexing structures.

To me this seems to be a very interesting option, though it requires 
considerable effort.

> Short term, I would like to see if there is any low hanging fruit in #1 
> that
> could be used to significantly speed up several types of queries.  I would
> also like to create some "theoretical" benchmarks to shoot for - i.e., what
> is the fastest that we could expect to access a node by UUID?  through a
> range query?  matching a particular node type and value?  Since database
> technology is mature and well optimized, this might be a good place to look
> for "the best that would could ever expect" as far as query speed goes and
> then compare Jackrabbit's results (query processing time).  I am not
> interested in matching DB speeds one-to-one, although I think it can be a
> useful benchmark that might get us thinking about why certain mismatches
> occur.

Agreed. Such benchmarks would be very useful.

regards
  marcel

Mime
View raw message