lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <>
Subject RE: OutOfMemoryException while Indexing an XML file/PdfParser
Date Wed, 19 Feb 2003 01:08:37 GMT

I am aware of the issues with parsing certain PDF documents.  I am
currently working on refactoring PDFBox to deal with large documents.  You
will see this in the next release.  I would like to thank people for
feedback and sending problem documents.

Ben Litchfield

On Tue, 18 Feb 2003, Pinky Iyer wrote:

> I am having similar problem but indexing pdf documents using pdfbox parser (available
at I get an exception saying "Exception in thread "main" java.lang.OutOfMemoryError"
Any body who has implemented the above code? Any help appreciated???
> Thanks!
> PI
>  Rob Outar <> wrote:We are aware of DOM limitations/memory
problems, but I am using SAX to parse
> the file and index elements and attributes in my content handler.
> Thanks,
> Rob
> -----Original Message-----
> From: Tatu Saloranta []
> Sent: Friday, February 14, 2003 8:18 PM
> To: Lucene Users List
> Subject: Re: OutOfMemoryException while Indexing an XML file
> On Friday 14 February 2003 07:27, Aaron Galea wrote:
> > I had this problem when using xerces to parse xml documents. The problem I
> > think lies in the Java garbage collector. The way I solved it was to
> create
> It's unlikely that GC is the culprit. Current ones are good at purging
> objects
> that are unreachable, and only throw OutOfMem exception when they really
> have
> no other choice.
> Usually it's the app that has some dangling references to objects that
> prevent
> GC from collecting objects not useful any more.
> However, it's good to note that Xerces (and DOM parsers in general)
> generally
> use more memory than the input XML files they process; this because they
> usually have to keep the whole document struct in memory, and there is
> overhead on top of text segments. So it's likely to be at least 2 * input
> file size (files usually use UTF-8 which most of the time uses 1 byte per
> char; in memory 16-bit unicode-2 chars are used for performance), plus some
> additional overhead for storing element structure information and all that.
> And since default max. java heap size is 64 megs, big XML files can cause
> problems.
> More likely however is that references to already processed DOM trees are
> not
> nulled in a loop or something like that? Especially if doing one JVM process
> for item solves the problem.
> > a shell script that invokes a java program for each xml file that adds it
> > to the index.
> -+ Tatu +-
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------
> Do you Yahoo!?
> Yahoo! Shopping - Send Flowers for Valentine's Day


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message