jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lorenzo Dini <Lorenzo.D...@cern.ch>
Subject Some performance questions about Jackrabbit
Date Fri, 01 Feb 2008 12:46:27 GMT
Hi Everybody,

I have been using Jackrabbit for almost 1 year now and I have some (a 
lot of :-) questions about the cost of operations performed on the 
repository because I am trying to optimize the performance and knowing 
what are the real operations done underneath helps for the tuning :-)

I hope somebody will answer and I hope this question will also help 
other people in using JR in the best way.

I am sure these questions apply to any deployment model, but I am just 
describing my case.

------------------------------------------------------------------------
Basically I have a Tomcat WAR with an Axis and a REST Webservices as 
front-end that use the in-thread JackRabbit to read-store lot of GB of data.

Currently I am using the Jackrabbit 1.3.1 directly embedded in a
custom application that runs on Tomcat.

I am using LocalFileSystemPersistenceManager and I am still using the 
SimpleDBPersistenceManager (with MySql in the same machine) because 
since I have lots of binaries in it, it was impossible to move to the 
BundlePersistenceManager before JR 1.4 since the DataStore was not in 
the release and I could not afford to store the binaries in the DB.

I have 2 workspaces:
Workspace 1------
Nodes:  22338	(added about 20 nodes per day)
Properties: 242239
Blobs: 	13558 files - 48 GB of storage (is stored in a AFS server with a 
softlink in the /blobs directory, no removal, just few megabytes added 
per day)


Workspace 2------
Nodes: 122605 (removed about 5000 nodes and added other 5000 nodes per day)
Properties: 1276972
Blobs: 23842 - 38 GB of storage (local file system, about 3GB of old 
data removed per day and other 3GB of new data added)

As suggested, I have a (almost) balanced tree structure with depth 8, 
there are not more than 100 children per node, usually no more than 20.
------------------------------------------------------------------------

QUESTIONS:

Session

1) How is the behavior when there are two session operating at the same 
time?
Whenever a session is open reading from the repository, and at the same 
time another session is writing in the repository and saving with 
node.save() or session.save(), are the changes cached in memory until 
the read only session is closed or the changes are visible in the 
read-only session? How does it work with nodes already in memory before 
the change? and with the nodes that are not in memory and must be read 
from the persistence after the change?

2) How much is the cost to create a new Session through a login?? Is it 
better to store them in a pool or just create them every time? Currently 
I store 1 session per workspace and return them in case of read-only 
access (whenever a write-access connection is requested, I generate a 
new Session and remove all the read-only from the pool, that means I 
never do the logout until the read-only is closed when a write-access is 
requested..) Shall I return 1 session per request no matter of the usage 
and always logout them?

3) What happens if a session is garbage collected without a logout has 
been executed?

4) Since I am not using the JR security, I have implemented my own 
classes for AccessManager and LoginModule that just return true and 
perform the minimal operation to allow anything. This cause an error in 
JR 1.4 at login() time.

Is the basic security provided by JR (SimpleAccessManager and 
SimpleLoginModule) add overhead for security checks? In case it does 
not, I will move back to them for better maintenance.


IO

5) Are the InputStream returned by getProperty("...").getStream() 
FileInputStream or BufferedInputStream? In case I would wrap them with a 
BufferedInputStream to try to improve the IO.

6) How much the MySqlBundlePersistenceManager in average improves the 
performances?? My bottleneck is always 100% of processor time with JAVA 
and never MySql that is using not more than 5-10%, will the BundlePM 
lower down the usage of processor by Java?

7) Is there any tool to get a readable version of the serialized node 
stored in the DB?

Backup

8) What is the difference, in any, in performance between:

new SysViewSAXEventGenerator(node, false, true, th).serialize();

and

session.exportSystemView(node.getPath(), ch, true, false);

and is there a way to spread the backup in a longer time in order not to 
use all the available resources?

9) What happens if during the backup (that for me takes more than 1 hour 
per workspace doing the commands in question 8) a lot of modifications 
are performed by other sessions?

10) Since it does not make sense to export a 90 GB XML file with the 
binaries inside, right now, to perform a backup, I am exporting the XML 
without binaries.

Importing it will overwrite all the binaries with new files 0 sized.

To restore it, I am changing the blobs location, import the xml, and 
then move back the blobs location to the original storage in order to 
remap the binaries. Since the node UUID do not change, it works.

Do you have a better way to do this? The problem is the same using a 
DataStore I think.

11) I am planning to move to JR 1.4 but it costs a lot in terms of 
migration of the whole storage to the new DataStore format.

Since the DataStore uses the md5 and not anymore the node UUID I cannot 
replace back the file structure generated by the Blobs.

The only way is to create a script to change the blobs structure to the 
new DataStore structure but for this I need a mapping Node UUID -> md5

Is there a way to know the file url from the Node instance??

If so, I could create a script that changes a specific file from the 
format No/de/UUID/propertyname.bin to the new format Fi/le/md5/...


Indexing and Searching

12) How much is the improvement of specifying the indexing rules? I am 
mainly use the name property for searching and few others... Setting 
this properties as priorital would speedup a lot? I think that most of 
the time is spent not on the lucine query itself but in loading and 
sorting the nodes.

13) When exactly the nodes are loaded from the DB by the QueryEngine?
What's happening during query.execute()?
What's during query.getNodes()? how many nodes are read from the DB?
When (and how) the sorting is done?
What's during iterator.nextNode()

14) How the sorting works since it cannot be done by the DB? Is it done 
by lucine? or simply all the nodes are sorted using a collections.sort? 
That means that all nodes must be loaded before returning the first and 
even if you need only the first N. How to speedup this?

15) Is there any change in JR 1.4? I saw it is possible to limit the 
entries returned and the offset, how this work with sorting?

16) In case I need a specific subnode with a particular property, is it 
faster to list all the subnodes using the node.getNodes() and picking 
the right one or doing a lucine query? I imagine it depends on the 
number of subnodes but aproximately for 20 subnodes the overhead of 
lucine overperform the getNodes()

NodeTypeDefinition

17) I use a quite complex nodetypedefinition, without references as 
suggested (I use strings and do the getNodeByUUID()). How much overhead 
this definition has in checking the types? I could enable it during 
development and testing and disable it in production.

I hope they are not ALL stupid questions, my apologies if some or most 
of them have been already discussed before I joined the mailing list.

Lorenzo Dini


-- 
*Lorenzo Dini*

CERN - European Organization for Nuclear Research
Information Technology Department
CH-1211 Geneva 23

Building 28 - Office 1-007
Phone: +41 (0) 22 7674384
Fax: +41 (0) 22 7668847
E-mail: Lorenzo.Dini@cern.ch

Mime
View raw message