Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE:  Re: Search performance  : MultiIndex
Date: Mon, 29 Oct 2007 11:42:18 +0100
Message-ID: <F8E386B54CE3E6408F3A32ABB9A7908A347D09@hai02.hippointern.lan>
In-Reply-To: <fg49ph$gg8$1@ger.gmane.org>
Thread-Topic: Re: Search performance  : MultiIndex
Thread-Index: AcgaD8hysLCz/+UHQquVWb9eJSW17gAA1j3Q
References: <F8E386B54CE3E6408F3A32ABB9A7908A347CEC@hai02.hippointern.lan>
 <F8E386B54CE3E6408F3A32ABB9A7908A347CFD@hai02.hippointern.lan>
 <fg49ph$gg8$1@ger.gmane.org>
From: "Ard Schrijvers" <a.schrijvers@hippo.nl>
To: <dev@jackrabbit.apache.org>


> Christoph Kiehl wrote:
> But as you mentioned in your previous mail there are some=20
> problematic queries which are way to slow like ChildAxisQuery=20
> or DescendantSelfAxisQuery. All queries that need to read=20
> lucene documents instead of just using a query get pretty=20
> slow with large repositories. But I didn't see a way yet how=20
> to substantially improve performance while using lucene. I=20
> even thought of using some other kind of indexing since lucene ...
> Internally we use a specific mixin for our documents as a=20
> workaround. This way I can avoid ChildAxisQueries and the=20
> like. I just query for "//element(*, foo-mix:document)[...]"=20
> for example. But that is just a dirty workaround.

This argument holds for my solution as well :-(
=20
> I would really like to find a solution to those problems.=20
> Maybe we should use some additional kind of index for=20
> resolving parent-child relations. Do you have any ideas yet=20
> how improve performance in those areas?

AFAICS, when we want to solve it within lucene with querying, we will
have a trade-off between "fast searching" and fast "moving of nodes"
(I'll get back on this one)=20

Currently, we are building a layer on top of JackRabbit that amongst
many other things at least needs to be able to:

1) port legacy code which had slide as repository
2) show all documents/nodes through faceted navigation=20

Since we have quite many large projects running with slide as
repository, and since we use a custom slide/lucene index to be able to
search fast, I need some queries in JackRabbit to be much faster than
currently possible. Obviously, since (2) must be implemented, almost
every call to JackRabbit will be a search. A very basic search we have
hundreds of times for legacy projects with slide would be:

/documents/en/news//*[@modificationDate] order by @modificationDate

Typically, a news folder contains tens of thousands of items, and this
query is not possible with the current JackRabbit impl (at least, my
experience is that for > 10.000 docs this query takes multiple seconds,
while I need the result in  < 50ms (50 is really the max IMO) ).

Now, I chose that for some queries that I control exactly, so I know I
won't have queries like /documents/en[1]/news[1] or
documents/en[@myprop]/news or documents/*/news, but only queries that
look like /nodename/nodename/nodename/**[......]  that I translate the
initial part to something like:

TermQuery(new Term(FieldNames.INITIAL_PATH, path)) where for example
path=3D'/documents/en/news'=20

Obviously, this only works when I index a node's path in some lucene
field. So a node with path /documents/en/news/2007/10/14/item.xml

would have lucene Field that contains the terms

'/documents/en/news/2007/10/14/item.xml'
'/documents/en/news/2007/10/14'
'/documents/en/news/2007/10'
'/documents/en/news/2007'
'/documents/en/news'
'/documents/en'
'/documents'

Obviously, this results in very fast simple lucene search for 'give me
all nodes starting with path x' because it is just 1 simple TermQuery,
but as a major disadvantage, it is now very costly to move a node,
because this requires re-indexing the tree below that node. Also, I can
only use it for queries with a basic 'start-path', though it might be
enhanced to suppose '*' and /nodename[@someprop].=20

Bottomline, I haven't found the holy grail either, but at least I have
responses within ms for hundreds of thousands of nodes :-) I am not sure
if there is a solution for fast searching for DescendantSelfAxisQuery
and at the same time fast moving of nodes. I choose to be able to search
fast, and hope people won't be moving the node directly under the root
to many times :-)=20

Regards Ard

>=20
> Cheers,
> Christoph
>=20
>=20