lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pinky Iyer <pinkyi...@yahoo.com>
Subject RE: OutOfMemoryException while Indexing an XML file/PdfParser
Date Wed, 19 Feb 2003 17:11:12 GMT

Thanks Matt, I am working on using the xpdf as suggested by you. I get error at the following
statement.
Could you elloborate on the statement 
String[] cmd = new String[] { 
        PATH_TO_XPDF, 
        "-enc", "UTF-8", "-q", filename, "-"}; 
I defined PATH_TO_XPDF as "c:/xpdf/pdftotext.exe" the rest remaining same. I get error saying
some incomapatible types, file and string, could not understand!
Thanks again!
Pinky
 Matt Tucker <matt@jivesoftware.com> wrote:Rob,

We ran into this problem too, and our solution was to use a native PDF
text extractor (PDFBox just can't seem to handle large PDFs well).
Basically, we try to parse with the native app first, and if that fails,
we parse with PDFBox. We used:

http://www.foolabs.com/xpdf/

A code snippet for using this is:

String[] cmd = new String[] { 
PATH_TO_XPDF, 
"-enc", "UTF-8", "-q", filename, "-"}; 
Process p = Runtime.getRuntime().exec(cmd); 
BufferedInputStream bis = new
BufferedInputStream(p.getInputStream()); 
InputStreamReader reader = new InputStreamReader(bis, "UTF-8"); 
out = new StringWriter(); 
char [] buf = new char[512]; 
int len; 
while ((len = reader.read(buf)) >= 0) { 
out.write(buf, 0, len); 
} 
reader.close();

Regards,
Matt

> -----Original Message-----
> From: Pinky Iyer [mailto:pinkyiyer@yahoo.com] 
> Sent: Tuesday, February 18, 2003 5:23 PM
> To: Lucene Users List
> Subject: RE: OutOfMemoryException while Indexing an XML file/PdfParser
> 
> 
> 
> I am having similar problem but indexing pdf documents using 
> pdfbox parser (available at www.pdfbox.com). I get an 
> exception saying "Exception in thread "main" 
> java.lang.OutOfMemoryError" Any body who has implemented the 
> above code? Any help appreciated??? Thanks! PI Rob Outar 
> wrote:We are aware of DOM 
> limitations/memory problems, but I am using SAX to parse the 
> file and index elements and attributes in my content handler.
> 
> Thanks,
> 
> Rob
> 
> -----Original Message-----
> From: Tatu Saloranta [mailto:tatu@hypermall.net]
> Sent: Friday, February 14, 2003 8:18 PM
> To: Lucene Users List
> Subject: Re: OutOfMemoryException while Indexing an XML file
> 
> 
> On Friday 14 February 2003 07:27, Aaron Galea wrote:
> > I had this problem when using xerces to parse xml documents. The 
> > problem I think lies in the Java garbage collector. The way 
> I solved 
> > it was to
> create
> 
> It's unlikely that GC is the culprit. Current ones are good 
> at purging objects that are unreachable, and only throw 
> OutOfMem exception when they really have no other choice. 
> Usually it's the app that has some dangling references to 
> objects that prevent GC from collecting objects not useful any more.
> 
> However, it's good to note that Xerces (and DOM parsers in 
> general) generally use more memory than the input XML files 
> they process; this because they usually have to keep the 
> whole document struct in memory, and there is overhead on top 
> of text segments. So it's likely to be at least 2 * input 
> file size (files usually use UTF-8 which most of the time 
> uses 1 byte per char; in memory 16-bit unicode-2 chars are 
> used for performance), plus some additional overhead for 
> storing element structure information and all that.
> 
> And since default max. java heap size is 64 megs, big XML 
> files can cause problems.
> 
> More likely however is that references to already processed 
> DOM trees are not nulled in a loop or something like that? 
> Especially if doing one JVM process for item solves the problem.
> 
> > a shell script that invokes a java program for each xml 
> file that adds 
> > it to the index.
> 
> -+ Tatu +-
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> 
> ---------------------------------
> Do you Yahoo!?
> Yahoo! Shopping - Send Flowers for Valentine's Day
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------
Do you Yahoo!?
Yahoo! Shopping - Send Flowers for Valentine's Day
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message