jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Some questions on JackRAbbit performance with large data sets
Date Tue, 06 Mar 2007 13:08:21 GMT
Sriram Narayanan wrote:
> 1. What are the lessons learned by various community members on using 
> Derby ?

what I heard from others playing with different setups is that derby over a 
network is quite slow. I didn't do any tests myself, but it seems that derby is 
the best choice if you use it in embedded more, but you should consider another 
db if you use a standalone db server.

> 2. Would you recommend using Oracle to using Derby for such large
> amounts of data ?

from what I've seen so far, both scale well with large amounts of data.

> 3. Are there ways to speed up lucene searches ?

1) there are configuration parameters that affect the query performance:
	a) respectDocumentOrder
	b) resultFetchSize
    see [1] for some details on those parameters.

2) some query feature are more expensive that others, which means you may be 
able to speed up searches by rephrasing your query statements.

> 4. Are lucene searches affected by such large indexes ?

access rights are checked at the very end of the query and will probably affect 
your queries negatively. because you have access rights that are limited to a 
certain customer most query results are rejected by access control in the last 
stage of the query execution. if we assume 250 customers and each has only 
access to its own tree an average of 99.6% of the query result nodes are 
rejected by access control.

> 5. Would it be better for us to split the repository into smaller ones
> and to then have smaller lucene indexes ?

if each customer has only access to its own tree I would definitively create one 
workspace per customer. this will result in:

- smaller indexes
- faster queries, because only a small amount of intermediate result nodes are 
rejected by access control
- you can configure an idle time which will shutdown workspaces that are not in 
use (-> saves resources)
- allows better concurrency because an update in one workspace does not affect 
other workspaces
- allows you to create db backups per customer

> 6. For such large data, would Embedded Derby or Network derby be
> suitable to the task ?

as mentioned before, I think derby does its job best if it runs embedded.



View raw message