jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Kurla" <stefan.ku...@gmail.com>
Subject Re: importing jackrabbit into jackrabbit
Date Thu, 26 Apr 2007 17:40:23 GMT
> Stefan Kurla wrote:
> > As far as the file nodetype is concerned, this is a custom nodetype
> > which has 4 references per file imported and currently, all the
> > references are made to the same UUID since we are testing, this could
> > change in the future.
>
> this may be the time consuming factor. whenever a reference is added that points
> to a node N the complete set of references pointing to N is re-written to the
> persistence manager. with increasing number of references to N this will slow
> down your import. is there a reason why all files point to the same node?
>
Imagine that you are the admin node N and you have access to every
file in the system. That could be one reason why all the nodes could
have references to N. This is when you have security structure inside
your workspace.

If this is the case, would it be wise to take the security out of the
workspace and store security infomation in a separate DB or workspace?


> > Any tips or ideas? I will update the results of the test. Right now I
> > have imported 1K out of 12K files and the import time has gone up to 4
> > seconds per file. Is this normal? Remember since I am importing the
> > jackrabbit SVN all files are put under one nt:folder which is
> > "jackrabbit". This is a pretty normal case of about 12K files and only
> > 78MB. We have plans of a 1TB repository.
>
> I did a quick test with an adapted version of
> http://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit-core/src/test/java/org/apache/jackrabbit/core/query/TextExtractorTest.java
> that saves changes whenever 100 files have been imported.
>
> I used the svn export of jackrabbit/trunk (~3000 files in ~900 folders)
>
> configuration:
> - jackrabbit in-process
> - o.a.j.c.persistence.db.DerbyPersistenceManager (externalBlobs = false)
> - text extractors: pdf, xml and plain text
>
> test result:
>
> Imported 2978 files in 50484 ms.

I ran the test case in a main class and got the repository over RMI
running over localhost and connecting to mysql running over localhost.
Test case size - 2226 files and 136MB, took 397 seconds.

Will try this test case with my code and node type structure now. Will
keep this thread updated.

I would appreciate the thoughts on references though. Reason being
that one of the biggest strengths of JSR-170 is the ability to store
references. I imagine a situation where i could have a nodetype call
docType which is either pdf or word strings. Say 80% of my documents
are word documents. Then the docType will have a reference to 80% of
all documents in my repository. If my repository is 100,000 files then
docType references 80,000 nodes.

If what you say is correct that at every new reference, the complete
set of references are rewritten, then obviously this is a bottleneck.

Should such a situation be avoided?

Thanks.
>
> regards
>   marcel
>

Mime
View raw message