jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: large repository
Date Tue, 25 Oct 2005 07:08:00 GMT
Hi John,

js@neasys.com wrote:
> I have tried jcr/jackrabbit and like it.
> Next I would like to push jackrabbit to its limit:
> load in as many items as possible. I would appreciate help on
> a few configuration/tuning issues:
> (1) which persistent manager to use?

in a recent test I imported over a million wikipedia articles which 
resulted in about 6 million items. no versioning, btw.

my configuration is:
dell latitude d505
db-persitence using derby
256m heap

at the beginning the time to add an article was about 5ms.
towards the end of the load the time to add an article was stable at 
about 50ms.

some other figures:
db size: 2 GB
index size: 300 MB

> (2) what parameters to tune?

I can give you some advice on configuring the index: the default config 
will cause lucene to create segments of 100 nodes, which will be merged 
when as soon as 10 segments exist. when doing a bulk load you should set 
the paramter minMergeDocs to a higher value. e.g. 1000. this will create 
segments of 1000 nodes, and will be more efficient.

> (3) will multiple wordspaces help?

IMO this might help, if you run into scalability issues with the 
persistence manager you are using.

> (4) any other things to watch for?

use separate disks for the index and workspace data.

> My host has 4GB ram and a few TB diskspace.
> Also, any doc describing all possbile elements in repository.xml?

the sample repository.xml file in src/conf contains an inline dtd that 
contains some documentation.

> And if SearchIndex can be turned off?

yes, this is possible. you simply omit the SearchIndex element in the 
configuration. though, I would be very interested to see how well the 
index works with your data.


View raw message