jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Guggisberg" <stefan.guggisb...@gmail.com>
Subject Re: importing XML file performance
Date Fri, 13 Apr 2007 14:12:03 GMT
hi alan,
sorry for the late response, it took me a while to figure out what's
causing the bad performance you experienced.

i performed a couple of tests on my machine (2ghz macbook). i used the
test data
you provided me (17mb document view-format xml file consisting of 40 nodes, 291
properties whereof 15 are of type BINARY). i used SimpleDbPersistenceManager
on a local mysql 5.0 instance.

here are the results of the Wokspace#importXML tests:

- ~180 seconds using jdk 1.4 (256mb heap)
- ~400 seconds using jdk 1.5 (256mb heap)

profiling revealed that about 85% of cpu time was spent in GC...
in order to further isolate the issue i measured the time it takes
to just parse the the xml file by passing a DefaultHandler (consisting
of noop's only) to the sax parser.

the results were somewhat baffling ;-) just parsing took ~250 seconds
using jdk 1.4, i.e. considerably longer than when performing a full import!?!

seems like a very busy GC causes these strange results...

further guessing and analysis showed that the binary properties in
document view serialization are responsible for the busy GC and bad
performance. the sax parser seems to be very inefficient in handling
large attribute values.

the good news is that a system view import of the same data
is a *lot* faster since property values are represented as element content
rather than attribute values:

~20 seconds (system view) vs ~180 seconds (document view).

for your specific use case (importing large binary properties) i'd recommend
using system view format.


On 4/11/07, Alan R <arof+nabble@messagio.com> wrote:
> Hi.  I'm using Jackrabbit 1.2.2 (JNDIPersistenceManager on MySQL, external
> blobs), and I'm finding the importXML() call very, very slow.  I've tried
> calling it both on the session and on the workspace and don't notice much
> difference.  Where the export takes seconds, the import takes minutes.  My
> average file size will be just under 20 MB (consisting of on the order of 20
> blobs and 60 nodes per file), and there could be tens or even hundreds of
> these to import in the case of a system restoration from backup.  I can
> afford 20 minutes to restore the system, but not 20 hours.
> Currently I've tried it with a single 17MB file, and workspace.importXML()
> takes 8 and a half minutes.  It was the same for session.importXML().
> Are there any performance enhancements underway?  This seems like a really
> important feature to speed up, because any time I need to migrate data,
> recover from a disk failure or change fundamental jackrabbit configuration,
> I will need to import exported data, including blobs.
> Is there something flawed in my backup/restoration strategy?
> Thanks.
> -Alan
> quipere wrote:
> >
> > Saving it on the workspace would take about half of the time, if I am
> > right. But I will than be stuck with the risk of making my own rollback
> > functions. Because I am persisting more actions than only the xmlimport on
> > one session.save().
> > Is the node.remove also loading all the childnodes in memory? Because when
> > I remove the mainNode of the imported xml. It consumes an even amount of
> > memory as the import function?
> >
> >
> > Jukka Zitting-3 wrote:
> >>
> >> On 10/24/06, quipere <jquipere@hotmail.com> wrote:
> >>> Does everybody have the same performance results while importing this
> >>> XML?
> >>
> >> Yes. The imported content is stored in the transient state of the
> >> session, which is kept fully in memory. Additionally, the Jackrabbit
> >> ItemState objects used to represent nodes and properties in memory are
> >> heavier than the DOM equivalents, so large XML files will use lots of
> >> memory when imported.
> >>
> >
> --
> View this message in context: http://www.nabble.com/importing-XML-file-performance-tf2493911.html#a9936830
> Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

View raw message