jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Klimetschek" <aklim...@day.com>
Subject Re: jackrabbit's xml import overhead
Date Thu, 10 Jul 2008 20:03:22 GMT
On Thu, Jul 10, 2008 at 7:47 PM, rokham <somebodyiknow@gmail.com> wrote:
>
> Hi all,
>
> I'm trying to decide between the following two options and haven't been able
> to get my answer doing some google searches.
>
> I am writing an application which requires importing MANY, LARGE (not
> exactly sure how many or how big yet...) xml files into jackrabbit. My
> concerns are two fold:
>
> 1. I need to use Lucene to index these xml files and I want to be able to
> run xpath queries on these data (the hierarchy of my content is important)
>
> 2. I want the process of importing (which will happen frequently) to be as
> fast as possible.

Ok, I know we always want our systems to handle a big pile of data as
fast as possible, with full indexing and fast queries in the end, and
not using a lot of diskspace, but in the end you have to make
tradeoffs ;-). XML brings a lot of inefficiencies with it and JCR is
not a pure XML database...


> I am not sure which of my two solutions work?
>
> solutions:
>
> 1. Import all the xml files using jackrabbit's xml import api
>  - This keeps the structure of the xml content but it's presumably slow.
> I'm not sure what's the overhead. I wonder if anyone has done any profiling
> for jackrabbit 1.4. Are there tweaks that can make this process faster?
>
> 2. Import all the xml files' content as plain strings
>  - I believe this will prevent lucene/jackrabbit to be aware of the
> hierarchy of the data, but I'm NOT sure.

No, if you import them as plain strings or even as binary streams
(nt:file for example), they will only be indexed as full-text.

> Would the imports be faster in this
> case?

Yes.

> Would they be a lot faster?

Using a DataStore for binary content it is probably much faster,
especially when the XML structure is large and deep and would create a
large JCR hierarchy.

> Would searching the content be as accurate
> as the first scenario?

Depends on the type of queries. But if you really need the hierarchy
information, the second case is not the way to go for you.

In general, experience has shown that using JCR as a plain XML
database is not very efficient, since XML typically has a lot of
redundant elements in its structure which can be better expressed with
JCR's model. For example, XML data structures often look like this

<element>
     <key>value</key>
     <key2>othervalue</key2>
</element>

which maps to too many JCR nodes with the standard document XML import:

+ element
    + key
        - cdata = value
    + key2
        - cdata = othervalue

A better JCR model would be:

+ element
    - key = value
    - key2 = othervalue

Jackrabbit is very efficient when you have a good node/property ratio,
ie. without many nodes that have no properties at all or only one
property, since it stores bundles (using the recommended bundle
persistence managers) of a node and all its properties.

So the best way for you is to think about a good mapping from the XML
datastructure to a JCR model and then write a custom importer (which
is not difficult).

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Mime
View raw message