jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Johnson" <dbjohnso...@gmail.com>
Subject Re: Query Performance and Optimization
Date Fri, 09 Mar 2007 19:30:09 GMT
-- snip --

yes, this should ensure that caching in lucene is used wherever possible.
> Even
> though there might be bugs that prevent this. Just like this one:
>
> http://svn.apache.org/viewvc?view=rev&revision=506908
>
> which prevented the re-use of SharedFiledSortComparator even if nothing
> changed
> between two query execution calls. You might want to check if this patch
> improves the situation for you.



That "patch" actually helped a fair amount - from 60 seconds to run 100
queries down to around 40 seconds.  Not too bad.  Any reason why this hasn't
made it into the current releases?


> What would be the expected
> > performance change if I have ongoing updates while querying the system?
>
> depending on the query the performance will be just a bit slower or
> significantly slower. e.g. an order by in your query on a property that is
> nearly on every node will significantly slow down the query (-> see my
> previous
> post about lucenes ability to cache the SharedFiledSortComparator).
>
> > So far the experiments that I have done with Lucene filters and
> Jackrabbit
> > have been disappointing.  I essentially used a QueryFilter and passed
> that
> > to the IndexSearcher in SearchIndex.executeQuery.  I created filters for
> > one
> > sub-part of a query - NodeType matching created in public Object
> > visit(NodeTypeQueryNode node, Object data) of the LuceneQueryBuilder
> class.
> > Rather than adding its part of the query to the larger Lucene query, I
> had
> > the function create a QueryFilter using the sub-part of the Lucene query
> > that would have been created by the function, and then returning null
> > instead of the query.  The filter was then later combined with the rest
> of
> > the query in IndexSearcher.  Finally, the filter was only created once,
> and
> > added to a filter map, so that it could be reused for queries against
> the
> > same nodetype.
>
> I still think that caching the documents that match a node type will not
> help
> much because those are simple term query, which are very efficient in
> lucene.



Yes, it didn't seem to help that much.

> I also noticed that a new IndexSearcher was created for each query
> > processed
> > - is this breaking the Lucene cache of filters?
>
> No, lucene uses the index reader as the key for its caches. e.g. in
> QueryFilter.bits():
>
>      synchronized (cache) {  // check cache
>        BitSet cached = (BitSet) cache.get(reader);
>        if (cached != null) {
>          return cached;
>        }
>      }
>
>
> > This is what I was hoping for, although I am wondering if the new
> > IndexSearcher that is created for each query execution is eliminating
> the
> > filter cache?
>
> afaics all filter caches are tied to the index reader and not the searcher
> instance.
>
> > I needed to "modify" the query tree so that the parts I "optimized" (i.e
> .,
> > removed) wouldn't create additional Lucene query terms.  It seems that
> > having the visitor function return null effectively removes any terms
> that
> > may have been created by that part of the visitor - in
> LuceneQueryBuilder.
> > Is this correct?
>
> I think writing your own customized LuceneQueryBuilder is easier than
> modifying
> the query tree.



Ok, I will give that a try.thethe


> In my explorations, I have noticed that a range query e.g., (myDate >
> > startDate AND myDate < endDate) seem to be translated into two Lucene
> range
> > queries that are ANDed together that looks like - ((startDate < myDate <
> > MAX_DATE) AND (MIN_DATE < myDate < endDate)).  I am guessing that Lucene
> > calculates each of the two range queries in isolation, and then ANDs the
> > results together - in a sense forcing the walk of the entire document
> set.
>
> that's correct.
>
> > Just from the look of it, it seems inefficient, although I am not sure
> how
> > much better it would be to translate the query to a single range query -
> > this transformation would also require a little analysis on the Query
> > Syntax
> > Tree - or a post optimization on the Lucene query.
>
> yes, this would be very useful and probably improve the performance
> significantly for certain cases. the LuceneQueryBuilder should probably do
> this
> optimization by analyzing the query tree.



I will see if I can puzzle out a solution to this scenario, and modify the
LuceneQueryBuilder code.


> Finally, some ideas on indexing.  It seems that there are two or perhaps
> > three choices that could be made for improving indexing:
> >
> > 1) Augment the already existing Lucene index with additional information
> to
> > speed certain types of queries - e.g., range queries.  For example, it
> > might
> > be possible to index each of the "bytes" of a date or long and optimize
> the
> > query using this additional information.
>
> Can you please elaborate on how this would work?



I think I was again focusing on range queries and giving Lucene some way of
filtering out subsets of the document set, so that the whole document set
wouldn't have to be walked.  For the date range query the from and to dates
would most likely share some set of most significant bytes - these bytes
could just be passed to Lucene as a direct match thereby reducing the subset
of the collection that would by walked.  If the range query is fixed this
"optimization" would be unnecessary.  Nevertheless, I still wonder if there
is additional information that could be stored in Lucene to augment the
index and improve query processing.


> 2) Create an external indexing structure for nodetypes and fields that
> > would
> > mirror similar structures in a database.  Again this data could be used
> to
> > optimize range queries as well as "sorted" results.
>
> I'm not sure how you would connect those two structures because lucene
> uses
> ephemeral document number:
>
> "IndexReader: For efficiency, in this API documents are often referred to
> via
> document numbers, non-negative integers which each name a unique document
> in the
> index. These document numbers are ephemeral--they may change as documents
> are
> added to and deleted from an index. Clients should thus not rely on a
> given
> document having the same number between sessions."


In this case I was considering using the node UUID as the cross-index join
parameter.  Still, there is the problem of combining the results from two
different indexes.

> 3) Use the database to provide the indexing structures.
>
> To me this seems to be a very interesting option, though it requires
> considerable effort.



Yes, I agree, this is an interesting option, and does seem that it would
take a fair amount of effort.  Your comments on the user list to this same
thread seems like a start to the thought process needed.  I am not very
familiar with the details of the PM, although I do think that bringing
together data storage and indexing will help with improving query processing
speed, as well as help with some data integrity issues that have been
discussed in other threads.

Over the weekend, I will see if I can come up with a solution to the range
query issue discussed above.

-Dave

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message