From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: Re: Search performance : MultiIndex
Date Mon, 29 Oct 2007 10:42:18 GMT

> Christoph Kiehl wrote:
> But as you mentioned in your previous mail there are some 
> problematic queries which are way to slow like ChildAxisQuery 
> or DescendantSelfAxisQuery. All queries that need to read 
> lucene documents instead of just using a query get pretty 
> slow with large repositories. But I didn't see a way yet how 
> to substantially improve performance while using lucene. I 
> even thought of using some other kind of indexing since lucene ...
> Internally we use a specific mixin for our documents as a 
> workaround. This way I can avoid ChildAxisQueries and the 
> like. I just query for "//element(*, foo-mix:document)[...]" 
> for example. But that is just a dirty workaround.

This argument holds for my solution as well :-(
> I would really like to find a solution to those problems. 
> Maybe we should use some additional kind of index for 
> resolving parent-child relations. Do you have any ideas yet 
> how improve performance in those areas?

AFAICS, when we want to solve it within lucene with querying, we will
have a trade-off between "fast searching" and fast "moving of nodes"
(I'll get back on this one) 

Currently, we are building a layer on top of JackRabbit that amongst
many other things at least needs to be able to:

1) port legacy code which had slide as repository
2) show all documents/nodes through faceted navigation 

Since we have quite many large projects running with slide as
repository, and since we use a custom slide/lucene index to be able to
search fast, I need some queries in JackRabbit to be much faster than
currently possible. Obviously, since (2) must be implemented, almost
every call to JackRabbit will be a search. A very basic search we have
hundreds of times for legacy projects with slide would be:

/documents/en/news//*[@modificationDate] order by @modificationDate

Typically, a news folder contains tens of thousands of items, and this
query is not possible with the current JackRabbit impl (at least, my
experience is that for > 10.000 docs this query takes multiple seconds,
while I need the result in  < 50ms (50 is really the max IMO) ).

Now, I chose that for some queries that I control exactly, so I know I
won't have queries like /documents/en[1]/news[1] or
documents/en[@myprop]/news or documents/*/news, but only queries that
look like /nodename/nodename/nodename/**[......]  that I translate the
initial part to something like:

TermQuery(new Term(FieldNames.INITIAL_PATH, path)) where for example

Obviously, this only works when I index a node's path in some lucene
field. So a node with path /documents/en/news/2007/10/14/item.xml

would have lucene Field that contains the terms


Obviously, this results in very fast simple lucene search for 'give me
all nodes starting with path x' because it is just 1 simple TermQuery,
but as a major disadvantage, it is now very costly to move a node,
because this requires re-indexing the tree below that node. Also, I can
only use it for queries with a basic 'start-path', though it might be
enhanced to suppose '*' and /nodename[@someprop]. 

Bottomline, I haven't found the holy grail either, but at least I have
responses within ms for hundreds of thousands of nodes :-) I am not sure
if there is a solution for fast searching for DescendantSelfAxisQuery
and at the same time fast moving of nodes. I choose to be able to search
fast, and hope people won't be moving the node directly under the root
to many times :-) 

Regards Ard

> Cheers,
> Christoph

