pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timo Boehme (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError
Date Fri, 10 Feb 2012 18:16:59 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205594#comment-13205594
] 

Timo Boehme commented on PDFBOX-1226:
-------------------------------------

The file is quite special in that it contains nearly 2 Mill. objects and 925354 pages. With
the current parsers it will be read completely until one can use the API to get the page count.
This is also true for PDFBOX-1199 because for the compatibility with the existing code base
it has to parse all objects (however: there is a 'parseMinimalCatalog' mode which can be used
in this case to parse only a smaller number of objects). So far PDFBOX-1199 has not bean landed
because encryption is currently not supported. You would have to build your own library using
SVN and the files in PDFBOX-1199.

For the time being in order to parse the sample file you will need approx. 2GB of heap space
(tested on my machine, took 153 seconds to parse and return page count). With lower amount
of memory GC will take most of the processing time.
                
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really
really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message