Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 23006 invoked from network); 2 Mar 2007 19:30:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Mar 2007 19:30:54 -0000 Received: (qmail 88018 invoked by uid 500); 2 Mar 2007 19:31:01 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 87991 invoked by uid 500); 2 Mar 2007 19:31:01 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 87982 invoked by uid 99); 2 Mar 2007 19:31:01 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2007 11:31:01 -0800 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of dbjohnson.e@gmail.com designates 209.85.132.249 as permitted sender) Received: from [209.85.132.249] (HELO an-out-0708.google.com) (209.85.132.249) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2007 11:30:50 -0800 Received: by an-out-0708.google.com with SMTP id d18so877988and for ; Fri, 02 Mar 2007 11:30:29 -0800 (PST) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Zod7wqph/MmGhy2y2d65P1F0sUGANafug2GnhrMhtSnhpneOLSD6wN0THaxEbYeHsPAYJj1K79xsaa5aNxjb7O273Acw1ZEhloNfsUfSVa9eY2lqUl2lZiaGuYtmcCpk1L2CzzZxzgQC+6Reh3A9HAXKn+LAbVfyytwtGSxPQHo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=QMUxWM4ErdifW7/7WVfL6YonMQWQucbG834GhbbutDLdsP8Kg3fQa4c2UmOjhR6KvzOszddclXCyNtW5M2Xa2ESYQ+Xb21hufy0CubMAegLGLK6cNvtmX9SUN7iNde8yOvYihWplBrz8diUMK/bhk/Rk7HEnZ0DRNepPRMV9WeY= Received: by 10.114.180.1 with SMTP id c1mr345196waf.1172863828386; Fri, 02 Mar 2007 11:30:28 -0800 (PST) Received: by 10.115.95.8 with HTTP; Fri, 2 Mar 2007 11:30:28 -0800 (PST) Message-ID: <4f95e0110703021130p47c7353am5722aba05970498@mail.gmail.com> Date: Fri, 2 Mar 2007 11:30:28 -0800 From: "David Johnson" To: dev@jackrabbit.apache.org Subject: Re: Query Performance and Optimization In-Reply-To: <510143ac0703020158s4f600935hbb094eb13cd4102c@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_106968_27261336.1172863828329" References: <4f95e0110702272149l3b409fd4h936a381868f5fbc9@mail.gmail.com> <510143ac0703020158s4f600935hbb094eb13cd4102c@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_106968_27261336.1172863828329 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi Jukka, Thanks for the reply. Yes, I am using Jackrabbit 1.2.x and I am not seeing that dramatic of a difference between 1.1.x and the 1.2.x, although I have not done a direct comparison between the two with the same query suite. It looks like adding ordering and or large range queries can significantly impact the query time. I would be very interested in working on and/or looking into optimization strategies and/or experiments. While I can puzzle out the code, and the structure of the query syntax tree, any pointers/documentation would be very much appreciated. In the case of an order by query and large range queries, it looks like significant time is spent in the org.apache.jackrabbit.core.query.lucene.SharedFiledSortComparator class, and that the result set is being walked in order to make sure that each result matches the query parameters and probably for sorting purposes. In the best case it would be good to have a pre-sorted index so that sorted results could be returned much more quickly. Similarly, the ability to define indexes on node properties could be used to speed results. Both of these types of indexes could be modeled and probably implemented similarly to well-known database indexing schemes. Finally, there are also optimizations - more tailored to JSR-170 needs - that could be implemented with Lucene filters - specifically Node Type query constraints - e.g., select * from my:mixin - the Filter could be defined to limit query searches / results to only those nodes that fulfill the specific node type constraint. These filters could be defined for each custom node type defined in the system and pre-calculated. I have not worked with Lucene Filters before, so I am not familiar with their speed - it might be interesting to get input from someone that has more experience with the run time behavior of Filters. Nevertheless, the Lucene docs mention Filters as a method to get around large range queries that inevitably break the 1024 Lucene query term limit. Further information on the Query Syntax Tree that is used by the LuceneQueryBuilder - and "safe" ways to modify it i.e., I would rather only modify the Query Syntax Tree and continue to use the LuceneQueryBuilder for most of the query processing - would be appreciated. -Dave On 3/2/07, Jukka Zitting wrote: > > Hi,, > > On 2/28/07, David Johnson wrote: > > "select * from Column where jcr:path like 'Gossip/ColumnName/Columns/%' > and > > status <> 'hidden' order by publishDate desc" takes 500 ms to execute - > this > > is just the execution time, I am not actually using or accessing the > > NodeIterator. > > Are you using Jackrabbit 1.2.x? Jackrabbit 1.2 uses lazy loading of > query results, which should considerably reduce query execution time > by moving the effort to the resulting Node- or RowIterator. > > In general my rule of thumb so far has been to use the query feature > when you want a narrow selection of nodes from a large source set, and > to use explicit traversal with filtering when the expected result set > includes a considerable percentage of the source set. Optimally the > query feature should in all cases be at least equal to traversal speed > plus a small constant query parsing and setup overhead. I don't think > we are there yet. > > > Digging into the internals of Jackrabbit, we have noticed that there is > an > > implementation of RangeQuery that essentially walks the results if the # > of > > query terms is greater than what Lucene can handle. Reading the Lucene > > documentation, it looks like Filters are the recommended method of > > implementing "large" range queries, and also seem like a natural for > > matching node types - i.e., select * from Column > > I'm not too familiar with Lucene details to comment on whether Filters > would cover everything we need. It would be great if you're interested > in pursuing such alternatives! > > > Is there any ongoing work on query optimization and performance. We > would > > be very interested in such work, including offering any help that we > can. > > Not apart from the recent lazy loading improvements. Any help would be > much appreciated. > > BR, > > Jukka Zitting > ------=_Part_106968_27261336.1172863828329--