lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Outar" <>
Subject RE: OutOfMemoryException while Indexing an XML file
Date Tue, 18 Feb 2003 12:48:59 GMT
We are aware of DOM limitations/memory problems, but I am using SAX to parse
the file and index elements and attributes in my content handler.



-----Original Message-----
From: Tatu Saloranta []
Sent: Friday, February 14, 2003 8:18 PM
To: Lucene Users List
Subject: Re: OutOfMemoryException while Indexing an XML file

On Friday 14 February 2003 07:27, Aaron Galea wrote:
> I had this problem when using xerces to parse xml documents. The problem I
> think lies in the Java garbage collector. The way I solved it was to

It's unlikely that GC is the culprit. Current ones are good at purging
that are unreachable, and only throw OutOfMem exception when they really
no other choice.
Usually it's the app that has some dangling references to objects that
GC from collecting objects not useful any more.

However, it's good to note that Xerces (and DOM parsers in general)
use more memory than the input XML files they process; this because they
usually have to keep the whole document struct in memory, and there is
overhead on top of text segments. So it's likely to be at least 2 * input
file size (files usually use UTF-8 which most of the time uses 1 byte per
char; in memory 16-bit unicode-2 chars are used for performance), plus some
additional overhead for storing element structure information and all that.

And since default max. java heap size is 64 megs, big XML files can cause

More likely however is that references to already processed DOM trees are
nulled in a loop or something like that? Especially if doing one JVM process
for item solves the problem.

> a shell script that invokes a java program for each xml file that adds it
> to the index.

-+ Tatu +-

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message