jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Moss" <mos...@googlemail.com>
Subject Re: Jackrabbit performance when adding many documents to repository.
Date Tue, 12 Dec 2006 11:24:50 GMT
Hi, thanks for the quick response.

On 12/12/06, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> Hi,
> On 12/12/06, David Moss <mossdo@googlemail.com> wrote:
> > On 12/11/06, Jukka Zitting <jukka.zitting@gmail.com> wrote:
> > > The save() operation is expensive but so is having a too large
> > > transient space. The best way to do bulk imports for now is to save()
> > > the transient changes every now and then, like once every 100 added
> > > nodes. This should give you a nice performance boost.
> >
> > Hmm, I've tried this but there didn't seem to be a noticeable
> difference.  I
> > wonder if it's simply the additional cost of indexing the documents that
> > takes so long?
> What kind of documents are you using? What index filter components
> have you enabled?

My documents vary and could really be any type, but include xls, pdf, doc,
html, plain text.
I've enabled all the text filters (I think):

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
            <param name="path" value="${wsp.home}/index"/>
            <param name="textFilterClasses" value="
            org.apache.jackrabbit.core.query.OpenOfficeTextFilter" />

A quick test of removing this setting from the config files, and the
insertion time for my 1500 documents is back down to ~ 4 minutes so it looks
like the indexing is to blame.

I'm currently only testing with a single pdf document, so perhaps the pdf
indexer is just particularly slow and results would be different with other
document types.

Is there any way that this can be done in the background!?

> > Note also that the RMI layer is not a very efficient way to access the
> > > repository. For best performance with bulk operations over the RMI
> > > layer I would definitely recommend using the XML import/export
> > > operations since they simply stream the XML data over the network.
> >
> > Thanks.  How would I go about doing this?  I need to be able to add
> non-xml
> > documents to the repository in a way that allows them to be indexed and
> > searched through Lucene.  Can I simply wrap the document binary in a
> minimal
> > xml document?
> The potential speedup depends on the size of the binary documents. The
> normal mechanism for adding a binary document would be:
>     Node parent = ...;
>     Node file = parent.addNode("filename", "nt:file");
>     Node resource = file.addNode("jcr:content", "nt:resource");
>     resource.setProperty("jcr:lastModified", ...);
>     resource.setProperty("jcr:mimeType", ...);
>     resource.setProperty("jcr:data", ...);
>     session.save();
> With JCR-RMI this causes the overhead of at least 6 network roundtrips
> (once per each method invocation), so the relative performance loss
> depends heavily on the size of the binary file. If the file is <10kB
> in size, then the overhead is considerable; and if the size is >100kB,
> then I would imagine the network throughput being the limiting factor.
> The XML serialization trick helps when you are importing a large
> number of small documents or non-binary properties. See the system
> view or document view formats and serialize your content using either
> one of them. Binary properties are serialized as base64 encoded
> strings.
> > Would interfacing with the repository via WebDAV be a better solution?
> Possibly, since imports over WebDAV are typically more coarse-grained
> (i.e. the file properties are transferred in one request along with
> the file contents). However, if you have a large number of small
> files, then the XML trick might still be faster.
> BR,
> Jukka Zitting

This all makes sense.  Actually, for the most part I'm connecting to the
repository from the same machine that the rmi server is running on (but not
exclusively!), so network roundtrips and throughput should be negligible
factors.  My primary reason for using RMI at all is that I need to access
the repository from multiple VMs.

Thanks again.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message