lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <ched...@hotmail.com>
Subject Re: OutOfMemoryException while Indexing an XML file
Date Mon, 17 Feb 2003 00:34:05 GMT
Maybe you can use SAX parse and indexing xml source streamly.

Here is my demo:
http://www.chedong.com/tech/lucene_ext.tar.gz

Che, Dong
----- Original Message ----- 
From: "Tatu Saloranta" <tatu@hypermall.net>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Saturday, February 15, 2003 9:18 AM
Subject: Re: OutOfMemoryException while Indexing an XML file


> On Friday 14 February 2003 07:27, Aaron Galea wrote:
> > I had this problem when using xerces to parse xml documents. The problem I
> > think lies in the Java garbage collector. The way I solved it was to create
> 
> It's unlikely that GC is the culprit. Current ones are good at purging objects 
> that are unreachable, and only throw OutOfMem exception when they really have 
> no other choice.
> Usually it's the app that has some dangling references to objects that prevent 
> GC from collecting objects not useful any more.
> 
> However, it's good to note that Xerces (and DOM parsers in general) generally 
> use more memory than the input XML files they process; this because they 
> usually have to keep the whole document struct in memory, and there is 
> overhead on top of text segments. So it's likely to be at least 2 * input 
> file size (files usually use UTF-8 which most of the time uses 1 byte per 
> char; in memory 16-bit unicode-2 chars are used for performance), plus some 
> additional overhead for storing element structure information and all that.
> 
> And since default max. java heap size is 64 megs, big XML files can cause 
> problems.
> 
> More likely however is that references to already processed DOM trees are not 
> nulled in a loop or something like that? Especially if doing one JVM process 
> for item solves the problem.
> 
> > a shell script that invokes a java program for each xml file that adds it
> > to the index.
> 
> -+ Tatu +-
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
Mime
View raw message