jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chewy_fruit_loop <chewy_fruit_l...@yahoo.com>
Subject Re: persistance
Date Fri, 14 Sep 2007 08:49:40 GMT

hmmm..... thats what I'am doing....

heres the code that kicks off the import :

  ContentHandler cHandler =
  HandlerWrapper handler = new HandlerWrapper(cHandler,session);

  XMLReader reader = XMLReaderFactory.createXMLReader();
  WorkspaceImpl wsp = (WorkspaceImpl) session.getWorkspace();
  session.getRepository().login(login,  ROOT_NODE);
  reader.parse(new org.xml.sax.InputSource(getXMLStream()));  

HandlerWrapper is my extension of ContentHandler in a very similar vein to 
this post 

i know that in theory that this should work...
once the repository has been created on the disk (about 6.5Mb), its size
will not increase until the endDodcument method has been called.  If you use
a typical XML document, I doubt you'll even notice this, but the documents
that will be going into the importer will range from hundreds of kilobytes
to 70+ Mb in size.

I had the notion that this could be derby holding on to the transactions
until a commit was issued to it, so I switched to the oracle manager and had
exactly the same result, only much slower (read significantly).

Is there a way to set the persistence manager to write as it goes?

I've had to set the jvm to have a maximum heap of 1.5Gb just so I can get to
the end of the document, and I've also had to turn of lucene as that was
making the heap over run and kill the program (which incidentally means I
now have to work out how to generate an index for the repository after the
import has finished as theres not enough memory available on a win32 system
to allocate 2Gb to the jvm, but thats another story).

I really want this to be a doh moment but I'm getting an uneasy feeling that
its not....

Florent Guillaume wrote:
> If you import that big a file, you should import directly into the 
> workspace and not in the session, without going through the transient 
> space and using lots of memory.
> So use Workspace.getImportContentHandler or Workspace.importXML, not the 
> Session methods. Read the JSR-170 for the benefits.
> Florent
> chewy_fruit_loop wrote:
>> I'm currently trying to import an XML file into a bog standard empty
>> repository.
>> The problem is the file is 72.5mb containing around 200,000 elements (yes
>> they are all required).  This is currently taking about 90 mins (give or
>> take) to get into derby, and thats with indexing off.
>> The time wouldn't be such an issue if it didn't use 1.7Gb of RAM.
>> I've decorated a ContentHandler so it calls :
>> root.update(<workspace name>)
>> root.save()
>> where root is the root node from the tree.
>> This is being called after every 500 start elements.  The save just
>> doesn't
>> seem to flush the contents that have been parsed to the persistent store. 
>> This is the same if I use derby or Oracle as storage.  The only time
>> things
>> seem to start to be persisted is when the endDocument is hit.
>> have I missed something blindingly obvious here?  I really don't mind
>> everyone having a bit of a chuckle at me, I just want to get this sorted
>> out.
>> thanks
> -- 
> Florent Guillaume, Director of R&D, Nuxeo
> Open Source Enterprise Content Management (ECM)
> http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87

View this message in context: http://www.nabble.com/persistance-tf4430069.html#a12671085
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

View raw message