jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Singer <Steven.Sin...@radintl.com>
Subject Re: importxml memory
Date Wed, 22 Aug 2007 21:56:25 GMT

Stefan,

Thanks for taking a look at it.

I'm aware that every node is versionable and we've noticed some of the 
issues with that. We have the perceived requirment of 'being able to 
easily revert any a set of changes' that lead us down that path. We will 
have to re-think our use of mix:versionable and look at other ways of 
accomplishing our application goals but haven't yet had a chance to do so 
(along with other modelling changes)

I'm actually surprised that our data structure is so deep, I hadn't 
intended it to be this way and am suspecting we have(or had) an 
application bug that is causing this.

The _delete_me nodes are because we where unable to delete corrupt nodes. 
What I think happened was that some edits were made to the repository when 
it was brought up pointing at a different version store (either a 
different repository.xml or someone copied/deleted a workspace data dir 
without the associated version store, or something else happened to 
corrupt the nodes). We kept getting InvalidItemState exceptions, the only 
way we could come up with getting rid of those nodes was to rename them 
and then strip them out on an import.

What I ended up doing was writing a program that connected to the source 
repository and my new destination repository.

It then walked the node tree performing a non-recursive exportxml and 
importxml one node at a time (stripping out the _delete_me nodes and 
versionHistory properties).  We performed a save after each node import.

Thanks for your help.



> hi steve
>
> On 7/30/07, Steven Singer <Steven.Singer@radintl.com> wrote:
>>
>> How are people using importxml to restore or import anything but small
>> amounts of data into the repository? I have a 22meg xml file that I'm
>> unable to import because I keep running out of memory.
>
> i analyzed the xml file that you sent me offline (thanks!).
> i noticed the following:
>
> 1) system view xml export
> 2) file size: 22mb without whitespace,
>    => 650mb with simple 2-space indentation (!)
> 3) 23k nodes and 202k properties
> 4) virtually every node is versionable
> 5) *very* deep structure: max depth is 2340... (!)
> 6) lots of junk data (e.g. thousands of _delete_me1234567890 nodes,
>    btw hundreds/thousands of levels deep and all versionable)
>
> i'd say that the content model has lots of room for improvement ;)
>
> mainly 5) accounts for the excessive memory consumption during
> import. while this could certainly be improved in jackrabbit i can't think of a
> really good use case for creating >2k level deep hierarchies.
>
> i also would suggest to review the use of mix:versionable. versionability
> doesn't come for free since it implies a certain overhead. making 1 node
> mix:versionable creates approx. 7 nodes and 13 properties in the version store
> (version history, root version etc etc). mix:versionable should therefore only
> be used where needed.
>
> btw: by using a decorated content handler which performed a save every
> 200 nodes i was able to import the data with 512mb heap. it took about
> 30 minutes on a macbook pro (2ghz).
>
> cheers
> stefan
>
>>
>> The importxml in in JCR commands works fine but when I go to save the data
>> the jvm memory usage goes up to 1GB and eventually runs out of memory.
>> This was sort of discussed
>> http://mail-archives.apache.org/mod_mbox/jackrabbit-users/200610.mbox/browser
>> but I didn't see any solutions proposed.
>>
>> Does the backup tool suffer from the same problem (being unable to restore
>> content above a certain size?)  How have other people handled migrating
>> data between different persistence managers or changing a node-type
>> definition that seems to require a re-import?
>>
>>
>>
>>
>> Steven Singer
>> RAD International Ltd.
>>
>>
>

Steven Singer
RAD International Ltd.


Mime
View raw message