pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Engberg (JIRA)" <j...@apache.org>
Subject [jira] Created: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org
Date Wed, 19 Aug 2009 19:30:15 GMT
PDFBox can't parse PDF documents from jstor.org
-----------------------------------------------

                 Key: PDFBOX-506
                 URL: https://issues.apache.org/jira/browse/PDFBOX-506
             Project: PDFBox
          Issue Type: Bug
            Reporter: Dave Engberg
         Attachments: siegel.pdf

The academic repository JStor makes papers available via PDF format.  The PDFs give this origin
information:
  Content creator:  JstorPdfGenerator v1.0
  PDF producer:  iText 2.0.6 (by lowagie.com)

These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an exception in PDFBox:

Exception in thread "main" java.io.IOException: Error: Expected to read '%%EOF' instead started
reading '1'
	at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
	at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
	at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
	at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)


I traced through the code, and it appears that PDFBox rejects these because they contain a
'startxref' that is not followed by a %%EOF two lines later:

...
startxref
613364
1 0 obj
...


Here's a small patch that will accept files that are missing the EOF after the startxref:


Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
===================================================================
--- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java	(revision 802578)
+++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java	(working copy)
@@ -453,11 +453,9 @@
             {  
                 parseStartXref();
                 //verify that EOF exists 
-                String eof = readExpectedString( "%%EOF" );
-                if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
-                {
-                    throw new IOException( "expected='%%EOF' actual='" + eof + "' next="
+ readString() +
-                            " next=" +readString() );
+                int c = pdfSource.peek();
+                if (c == '%') {
+                    readExpectedString("%%EOF");
                 }
                 isEndOfFile = true; 
             }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message