Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of stefan.guggisberg@gmail.com
 designates 64.233.184.235 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=Pykv9C+fRFK+62jWUuINCNB7Tdy698G+R0LWiok1OQYjxZrH2rp0LxN9XV02sg3PWUpSaDgqmk8NXBYZS7bOq4n4+kcGrhuN45HP7+oQFIn6pSRs53Q72KCHV32dgivlCV/tRnyean+/HJjadVEulFXGcsulsWpR9R6r/SqUns8=
Message-ID: <90a8d1c00802010738s96ba414vafb28b81da6ffe5f@mail.gmail.com>
Date: Fri, 1 Feb 2008 16:38:09 +0100
From: "Stefan Guggisberg" <stefan.guggisberg@gmail.com>
To: users@jackrabbit.apache.org
Subject: Re: Some performance questions about Jackrabbit
In-Reply-To: <47A314A3.3000501@cern.ch>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <47A314A3.3000501@cern.ch>

hi lorenzo

i'll try to answer some of your questions in-line...

On Feb 1, 2008 1:46 PM, Lorenzo Dini <Lorenzo.Dini@cern.ch> wrote:
> Hi Everybody,
>
> I have been using Jackrabbit for almost 1 year now and I have some (a
> lot of :-) questions about the cost of operations performed on the
> repository because I am trying to optimize the performance and knowing
> what are the real operations done underneath helps for the tuning :-)
>
> I hope somebody will answer and I hope this question will also help
> other people in using JR in the best way.
>
> I am sure these questions apply to any deployment model, but I am just
> describing my case.
>
> ------------------------------------------------------------------------
> Basically I have a Tomcat WAR with an Axis and a REST Webservices as
> front-end that use the in-thread JackRabbit to read-store lot of GB of data.
>
> Currently I am using the Jackrabbit 1.3.1 directly embedded in a
> custom application that runs on Tomcat.
>
> I am using LocalFileSystemPersistenceManager and I am still using the
> SimpleDBPersistenceManager (with MySql in the same machine) because
> since I have lots of binaries in it, it was impossible to move to the
> BundlePersistenceManager before JR 1.4 since the DataStore was not in
> the release and I could not afford to store the binaries in the DB.
>
> I have 2 workspaces:
> Workspace 1------
> Nodes:  22338   (added about 20 nodes per day)
> Properties: 242239
> Blobs:  13558 files - 48 GB of storage (is stored in a AFS server with a
> softlink in the /blobs directory, no removal, just few megabytes added
> per day)
>
>
> Workspace 2------
> Nodes: 122605 (removed about 5000 nodes and added other 5000 nodes per day)
> Properties: 1276972
> Blobs: 23842 - 38 GB of storage (local file system, about 3GB of old
> data removed per day and other 3GB of new data added)
>
> As suggested, I have a (almost) balanced tree structure with depth 8,
> there are not more than 100 children per node, usually no more than 20.
> ------------------------------------------------------------------------
>
> QUESTIONS:
>
> Session
>
> 1) How is the behavior when there are two session operating at the same
> time?
> Whenever a session is open reading from the repository, and at the same
> time another session is writing in the repository and saving with
> node.save() or session.save(), are the changes cached in memory until
> the read only session is closed or the changes are visible in the
> read-only session? How does it work with nodes already in memory before
> the change? and with the nodes that are not in memory and must be read
> from the persistence after the change?

since you're using an in-proc jackrabbit instance, all changes saved by
session A will be instantly reflected in session B's state, i.e. a Node instance
read by session B will be instantly updated to reflect the changes saved
by session A.

the only situation a Node/Property instance can become 'stale'  is when
session B had made transient (unsaved) changes and the same items
had been modified (saved) by another session in the mean time.

>
> 2) How much is the cost to create a new Session through a login?? Is it
> better to store them in a pool or just create them every time? Currently
> I store 1 session per workspace and return them in case of read-only
> access (whenever a write-access connection is requested, I generate a
> new Session and remove all the read-only from the pool, that means I
> never do the logout until the read-only is closed when a write-access is
> requested..) Shall I return 1 session per request no matter of the usage
> and always logout them?

JCR sessions are not light-weight, usually it makes therefore sense to
use some pooling mechanism.

>
> 3) What happens if a session is garbage collected without a logout has
> been executed?
>
> 4) Since I am not using the JR security, I have implemented my own
> classes for AccessManager and LoginModule that just return true and
> perform the minimal operation to allow anything. This cause an error in
> JR 1.4 at login() time.
>
> Is the basic security provided by JR (SimpleAccessManager and
> SimpleLoginModule) add overhead for security checks? In case it does
> not, I will move back to them for better maintenance.

SimpleAccessManager and SimpleLoginModule are just dummy
implementations. using them shouldn't incur any significant performance
overhead.

>
>
> IO
>
> 5) Are the InputStream returned by getProperty("...").getStream()
> FileInputStream or BufferedInputStream? In case I would wrap them with a
> BufferedInputStream to try to improve the IO.

assuming we're talking of binary properties, Property.getStream()
returns either
a ByteArrayInputStream or a FileInputStream, depending on the size of data.

while it doesn't make much sense in the former case, wrapping a FileInputStream
with a BufferedInputStream should generally improve read performance.

>
> 6) How much the MySqlBundlePersistenceManager in average improves the
> performances?? My bottleneck is always 100% of processor time with JAVA
> and never MySql that is using not more than 5-10%, will the BundlePM
> lower down the usage of processor by Java?
>
> 7) Is there any tool to get a readable version of the serialized node
> stored in the DB?

not yet, you would have to write such a tool ;)

>
> Backup
>
> 8) What is the difference, in any, in performance between:
>
> new SysViewSAXEventGenerator(node, false, true, th).serialize();
>
> and
>
> session.exportSystemView(node.getPath(), ch, true, false);
>
> and is there a way to spread the backup in a longer time in order not to
> use all the available resources?
>
> 9) What happens if during the backup (that for me takes more than 1 hour
> per workspace doing the commands in question 8) a lot of modifications
> are performed by other sessions?
>
> 10) Since it does not make sense to export a 90 GB XML file with the
> binaries inside, right now, to perform a backup, I am exporting the XML
> without binaries.
>
> Importing it will overwrite all the binaries with new files 0 sized.
>
> To restore it, I am changing the blobs location, import the xml, and
> then move back the blobs location to the original storage in order to
> remap the binaries. Since the node UUID do not change, it works.
>
> Do you have a better way to do this? The problem is the same using a
> DataStore I think.
>
> 11) I am planning to move to JR 1.4 but it costs a lot in terms of
> migration of the whole storage to the new DataStore format.
>
> Since the DataStore uses the md5 and not anymore the node UUID I cannot
> replace back the file structure generated by the Blobs.
>
> The only way is to create a script to change the blobs structure to the
> new DataStore structure but for this I need a mapping Node UUID -> md5
>
> Is there a way to know the file url from the Node instance??
>
> If so, I could create a script that changes a specific file from the
> format No/de/UUID/propertyname.bin to the new format Fi/le/md5/...
>
>
> Indexing and Searching
>
> 12) How much is the improvement of specifying the indexing rules? I am
> mainly use the name property for searching and few others... Setting
> this properties as priorital would speedup a lot? I think that most of
> the time is spent not on the lucine query itself but in loading and
> sorting the nodes.
>
> 13) When exactly the nodes are loaded from the DB by the QueryEngine?
> What's happening during query.execute()?
> What's during query.getNodes()? how many nodes are read from the DB?
> When (and how) the sorting is done?
> What's during iterator.nextNode()
>
> 14) How the sorting works since it cannot be done by the DB? Is it done
> by lucine? or simply all the nodes are sorted using a collections.sort?
> That means that all nodes must be loaded before returning the first and
> even if you need only the first N. How to speedup this?
>
> 15) Is there any change in JR 1.4? I saw it is possible to limit the
> entries returned and the offset, how this work with sorting?
>
> 16) In case I need a specific subnode with a particular property, is it
> faster to list all the subnodes using the node.getNodes() and picking
> the right one or doing a lucine query? I imagine it depends on the
> number of subnodes but aproximately for 20 subnodes the overhead of
> lucine overperform the getNodes()
>
> NodeTypeDefinition
>
> 17) I use a quite complex nodetypedefinition, without references as
> suggested (I use strings and do the getNodeByUUID()). How much overhead
> this definition has in checking the types? I could enable it during
> development and testing and disable it in production.

IMO there shouldn't be any signifcant perfomance impact when using complex
node types. however, i'd like to encourage you to perform some tests
(and share the results;).

>
> I hope they are not ALL stupid questions, my apologies if some or most
> of them have been already discussed before I joined the mailing list.

don't worry, you're questions are very much appreciated!

cheers
stefan

>
> Lorenzo Dini
>
>
> --
> *Lorenzo Dini*
>
> CERN - European Organization for Nuclear Research
> Information Technology Department
> CH-1211 Geneva 23
>
> Building 28 - Office 1-007
> Phone: +41 (0) 22 7674384
> Fax: +41 (0) 22 7668847
> E-mail: Lorenzo.Dini@cern.ch
>