Return-Path: Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: (qmail 96932 invoked from network); 29 Oct 2007 10:42:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Oct 2007 10:42:49 -0000 Received: (qmail 51495 invoked by uid 500); 29 Oct 2007 10:42:36 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 51459 invoked by uid 500); 29 Oct 2007 10:42:36 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 51450 invoked by uid 99); 29 Oct 2007 10:42:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Oct 2007 03:42:36 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [213.133.51.241] (HELO mail.hippo.nl) (213.133.51.241) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Oct 2007 10:42:40 +0000 X-MimeOLE: Produced By Microsoft Exchange V6.5.7235.2 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Re: Search performance : MultiIndex Date: Mon, 29 Oct 2007 11:42:18 +0100 Message-ID: In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Re: Search performance : MultiIndex Thread-Index: AcgaD8hysLCz/+UHQquVWb9eJSW17gAA1j3Q References: From: "Ard Schrijvers" To: X-Virus-Checked: Checked by ClamAV on apache.org > Christoph Kiehl wrote: > But as you mentioned in your previous mail there are some=20 > problematic queries which are way to slow like ChildAxisQuery=20 > or DescendantSelfAxisQuery. All queries that need to read=20 > lucene documents instead of just using a query get pretty=20 > slow with large repositories. But I didn't see a way yet how=20 > to substantially improve performance while using lucene. I=20 > even thought of using some other kind of indexing since lucene ... > Internally we use a specific mixin for our documents as a=20 > workaround. This way I can avoid ChildAxisQueries and the=20 > like. I just query for "//element(*, foo-mix:document)[...]"=20 > for example. But that is just a dirty workaround. This argument holds for my solution as well :-( =20 > I would really like to find a solution to those problems.=20 > Maybe we should use some additional kind of index for=20 > resolving parent-child relations. Do you have any ideas yet=20 > how improve performance in those areas? AFAICS, when we want to solve it within lucene with querying, we will have a trade-off between "fast searching" and fast "moving of nodes" (I'll get back on this one)=20 Currently, we are building a layer on top of JackRabbit that amongst many other things at least needs to be able to: 1) port legacy code which had slide as repository 2) show all documents/nodes through faceted navigation=20 Since we have quite many large projects running with slide as repository, and since we use a custom slide/lucene index to be able to search fast, I need some queries in JackRabbit to be much faster than currently possible. Obviously, since (2) must be implemented, almost every call to JackRabbit will be a search. A very basic search we have hundreds of times for legacy projects with slide would be: /documents/en/news//*[@modificationDate] order by @modificationDate Typically, a news folder contains tens of thousands of items, and this query is not possible with the current JackRabbit impl (at least, my experience is that for > 10.000 docs this query takes multiple seconds, while I need the result in < 50ms (50 is really the max IMO) ). Now, I chose that for some queries that I control exactly, so I know I won't have queries like /documents/en[1]/news[1] or documents/en[@myprop]/news or documents/*/news, but only queries that look like /nodename/nodename/nodename/**[......] that I translate the initial part to something like: TermQuery(new Term(FieldNames.INITIAL_PATH, path)) where for example path=3D'/documents/en/news'=20 Obviously, this only works when I index a node's path in some lucene field. So a node with path /documents/en/news/2007/10/14/item.xml would have lucene Field that contains the terms '/documents/en/news/2007/10/14/item.xml' '/documents/en/news/2007/10/14' '/documents/en/news/2007/10' '/documents/en/news/2007' '/documents/en/news' '/documents/en' '/documents' Obviously, this results in very fast simple lucene search for 'give me all nodes starting with path x' because it is just 1 simple TermQuery, but as a major disadvantage, it is now very costly to move a node, because this requires re-indexing the tree below that node. Also, I can only use it for queries with a basic 'start-path', though it might be enhanced to suppose '*' and /nodename[@someprop].=20 Bottomline, I haven't found the holy grail either, but at least I have responses within ms for hundreds of thousands of nodes :-) I am not sure if there is a solution for fast searching for DescendantSelfAxisQuery and at the same time fast moving of nodes. I choose to be able to search fast, and hope people won't be moving the node directly under the root to many times :-)=20 Regards Ard >=20 > Cheers, > Christoph >=20 >=20