jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Lucene index
Date Fri, 20 Apr 2007 07:30:52 GMT
Hi James,

James Hang wrote:
> After spending some time running Jackrabbit in debug mode, I noticed some
> peculiar behavior in the Lucene SearchIndex implementation.  
> When indexing of a Node occurs via the AbstractIndex.addDocument() method,
> the Lucene Document object being indexed seems to contain all the indexed
> fields, i.e. all the properties of the node, the extracted fulltext terms,
> etc.  
> However, during a search operation, on the call to
> SearchIndex.executeQuery(), the Document objects being returned from the
> search only contains some of the indexed fields.  In fact for all of the
> Document objects, only these 5 fields are present:
> _:UUID
> _:PROPERTIES[0] "3:versionHistory"
> _:PROPERTIES[1] "3:baseVersion"
> _:PROPERTIES[2] "3:predecessor"
> I know that Jackrabbit only really needs the _:UUID field so that it can
> look up the Node, so is it stripping out the other fields at some point? 

The other properties you see in the lucene document are all reference 
properties. Those must also be *stored* in the index to resolve them in a 
jcr:deref() statement.

> We've noticed that for large result sets (1000+ nodes), the performance can
> drag because each Node lookup requires at least one database query.  Since
> we are only interested in data contained in the Lucene index, it would be
> nice if we would get that data from the index and not have to go through the
> Jackrabbit PM at all.

this is because the index does not store the property values (in lucene 
terminology) but only uses them to create an inverted index using the values. 
funny enough the values are actually present in the index, but you cannot get 
them using a simple call.

in addition, the presence of the jcr:path property in the query result row 
forces us to use node information to resolve the path (be it from the 
persistence manager or from a cache within jackrabbit). in jackrabbit a path of 
a node is always calculated and not stored literally with it.

> Does anyone know if this is possible?

With the current implementation this is not possible, but it is feasible to 
implement it.

The required changes to the jackrabbit would be:

- also store the values of all the other properties within the index. because 
this makes the document instances retrieved from the index much heavier we would 
have to move to lucene 2.1. this version supports lazy loading of document fields.
- a query result row would then use the values from the index, if available. 
whether property values are stored in the index, should be configurable.
- calculate the values of the jcr:path column only when requested.

with those changes at least the RowIterator result representation could work 
without a single access to the PM.

ah, well. since I've put that all in an email I can as well create a jira issue ;)



View raw message