jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Hang <jh...@bea.com>
Subject Re: Lucene index
Date Fri, 20 Apr 2007 22:45:33 GMT

Thanks Marcel,

Storing the properties in the index on index time did the trick.  Thanks!

Definitely would love to see those enhancements you suggested.   Being able
to retrieve results from a search without having to go to the PM would be a
big performance boost for us.

James



Marcel Reutegger wrote:
> 
> Hi James,
> 
> James Hang wrote:
>> After spending some time running Jackrabbit in debug mode, I noticed some
>> peculiar behavior in the Lucene SearchIndex implementation.  
>> 
>> When indexing of a Node occurs via the AbstractIndex.addDocument()
>> method,
>> the Lucene Document object being indexed seems to contain all the indexed
>> fields, i.e. all the properties of the node, the extracted fulltext
>> terms,
>> etc.  
>> 
>> However, during a search operation, on the call to
>> SearchIndex.executeQuery(), the Document objects being returned from the
>> search only contains some of the indexed fields.  In fact for all of the
>> Document objects, only these 5 fields are present:
>> 
>> _:UUID
>> _:PARENT
>> _:PROPERTIES[0] "3:versionHistory"
>> _:PROPERTIES[1] "3:baseVersion"
>> _:PROPERTIES[2] "3:predecessor"
>> 
>> I know that Jackrabbit only really needs the _:UUID field so that it can
>> look up the Node, so is it stripping out the other fields at some point? 
> 
> The other properties you see in the lucene document are all reference 
> properties. Those must also be *stored* in the index to resolve them in a 
> jcr:deref() statement.
> 
>> We've noticed that for large result sets (1000+ nodes), the performance
>> can
>> drag because each Node lookup requires at least one database query. 
>> Since
>> we are only interested in data contained in the Lucene index, it would be
>> nice if we would get that data from the index and not have to go through
>> the
>> Jackrabbit PM at all.
> 
> this is because the index does not store the property values (in lucene 
> terminology) but only uses them to create an inverted index using the
> values. 
> funny enough the values are actually present in the index, but you cannot
> get 
> them using a simple call.
> 
> in addition, the presence of the jcr:path property in the query result row 
> forces us to use node information to resolve the path (be it from the 
> persistence manager or from a cache within jackrabbit). in jackrabbit a
> path of 
> a node is always calculated and not stored literally with it.
> 
>> Does anyone know if this is possible?
> 
> With the current implementation this is not possible, but it is feasible
> to 
> implement it.
> 
> The required changes to the jackrabbit would be:
> 
> - also store the values of all the other properties within the index.
> because 
> this makes the document instances retrieved from the index much heavier we
> would 
> have to move to lucene 2.1. this version supports lazy loading of document
> fields.
> - a query result row would then use the values from the index, if
> available. 
> whether property values are stored in the index, should be configurable.
> - calculate the values of the jcr:path column only when requested.
> 
> with those changes at least the RowIterator result representation could
> work 
> without a single access to the PM.
> 
> ah, well. since I've put that all in an email I can as well create a jira
> issue ;)
> 
> http://issues.apache.org/jira/browse/JCR-855
> 
> regards
>   marcel
> 
> 

-- 
View this message in context: http://www.nabble.com/Lucene-index-tf3604049.html#a10111578
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Mime
View raw message