pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raymi <ra...@gmx.ch>
Subject Extract Text from complex PDF
Date Mon, 07 Dec 2009 13:48:21 GMT
Hi there,

I'm using PDFTextStripper to extract text from PDFs. Among these PDFs 
there are some documents that represent maps. The size of a document is 
about 90MB. These maps have very little text, but many little graphic 
objects (well, I don't know how to find out, but if I open the document 
with Adobe Reader it looks like). This causes the PDFParser to create 
millions of COSFloat objects and finally crashes the JVM with an 
OutOfMemoryException.

While I understand that it is not possible to extract text without prior 
parsing (as noted in the FAQ), I wonder whether it would be possible to 
simply skip objects that contain no textual content? The PDF tree would 
be incomplete, but I only want to extract the text.

Thanks in advance,

Dominik
P.s.: unfortunately I cannot provide an example of such a document 
because they contain confidental content.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message