lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: OutOfMemoryException while Indexing an XML file
Date Sat, 15 Feb 2003 01:18:13 GMT
On Friday 14 February 2003 07:27, Aaron Galea wrote:
> I had this problem when using xerces to parse xml documents. The problem I
> think lies in the Java garbage collector. The way I solved it was to create

It's unlikely that GC is the culprit. Current ones are good at purging objects 
that are unreachable, and only throw OutOfMem exception when they really have 
no other choice.
Usually it's the app that has some dangling references to objects that prevent 
GC from collecting objects not useful any more.

However, it's good to note that Xerces (and DOM parsers in general) generally 
use more memory than the input XML files they process; this because they 
usually have to keep the whole document struct in memory, and there is 
overhead on top of text segments. So it's likely to be at least 2 * input 
file size (files usually use UTF-8 which most of the time uses 1 byte per 
char; in memory 16-bit unicode-2 chars are used for performance), plus some 
additional overhead for storing element structure information and all that.

And since default max. java heap size is 64 megs, big XML files can cause 
problems.

More likely however is that references to already processed DOM trees are not 
nulled in a loop or something like that? Especially if doing one JVM process 
for item solves the problem.

> a shell script that invokes a java program for each xml file that adds it
> to the index.

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message