jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Jackrabbit performance when adding many documents to repository.
Date Tue, 12 Dec 2006 10:55:36 GMT
Hi,

On 12/12/06, David Moss <mossdo@googlemail.com> wrote:
> On 12/11/06, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> > The save() operation is expensive but so is having a too large
> > transient space. The best way to do bulk imports for now is to save()
> > the transient changes every now and then, like once every 100 added
> > nodes. This should give you a nice performance boost.
>
> Hmm, I've tried this but there didn't seem to be a noticeable difference.  I
> wonder if it's simply the additional cost of indexing the documents that
> takes so long?

What kind of documents are you using? What index filter components
have you enabled?

> > Note also that the RMI layer is not a very efficient way to access the
> > repository. For best performance with bulk operations over the RMI
> > layer I would definitely recommend using the XML import/export
> > operations since they simply stream the XML data over the network.
>
> Thanks.  How would I go about doing this?  I need to be able to add non-xml
> documents to the repository in a way that allows them to be indexed and
> searched through Lucene.  Can I simply wrap the document binary in a minimal
> xml document?

The potential speedup depends on the size of the binary documents. The
normal mechanism for adding a binary document would be:

    Node parent = ...;
    Node file = parent.addNode("filename", "nt:file");
    Node resource = file.addNode("jcr:content", "nt:resource");
    resource.setProperty("jcr:lastModified", ...);
    resource.setProperty("jcr:mimeType", ...);
    resource.setProperty("jcr:data", ...);
    session.save();

With JCR-RMI this causes the overhead of at least 6 network roundtrips
(once per each method invocation), so the relative performance loss
depends heavily on the size of the binary file. If the file is <10kB
in size, then the overhead is considerable; and if the size is >100kB,
then I would imagine the network throughput being the limiting factor.

The XML serialization trick helps when you are importing a large
number of small documents or non-binary properties. See the system
view or document view formats and serialize your content using either
one of them. Binary properties are serialized as base64 encoded
strings.

> Would interfacing with the repository via WebDAV be a better solution?

Possibly, since imports over WebDAV are typically more coarse-grained
(i.e. the file properties are transferred in one request along with
the file contents). However, if you have a large number of small
files, then the XML trick might still be faster.

BR,

Jukka Zitting

Mime
View raw message