pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Bowditch (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PDFBOX-504) Can't Parse any PDF using IBM JDK
Date Thu, 13 Aug 2009 11:54:14 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Bowditch updated PDFBOX-504:
----------------------------------

    Attachment: ibm-parse-bug.patch

The Attached patch resolves the problem. However it feels like a bit of a bodge. I did try
to work out why certain characters were being dropped from the String to Bytes conversion
on IBM JDK but have so far failed. I experimented with different encodings but couldn't find
one that gave the same results as Sun JDK. Perhaps someone knows a better way to fix this.
For now I will go with this hack as this issue is critical for my customer.

> Can't Parse any PDF using IBM JDK
> ---------------------------------
>
>                 Key: PDFBOX-504
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-504
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: RedHat Linux IBM JDK
>            Reporter: Chris Bowditch
>            Priority: Critical
>         Attachments: ibm-parse-bug.patch, readable.pdf
>
>
> All PDF (that I have tried) fail to parse using IBM JDK 1.5 on RedHat Linux. The error
you receive is:
> Exception in thread "main" java.io.IOException: Error: Expected an integer type, actual='ãÃÃ'
>         at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
>         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:493)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
>         at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
>         at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
>         at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)
> Although after debugging the actual error is hidden:
> java.io.IOException: Error: Expected an integer type, actual='ãÏÓ'
>         at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
>         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:483)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
>         at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
>         at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
>         at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)
> The characters shown in the hidden message occur at the start of most PDF Files that
I have checked:
> %PDF-1.4
> %âãÏÓ
> 6 0 obj
> <</Filter /FlateDecode
> /Length 489
> >>
> stream
> Tracing the code I can see the problem is down to the skipToNextObject() method in PDFParser
class. This method is new since v0.7.4.
> The code converts the array of 16 bytes to a String. The characters âãÏÓ are read
as negative numbers in both Sun and IBM JDKs but whilst on Sun the String created from the
byte array contains the characters on IBM JDK these characters are missing from the String.
So when you read back 16 characters the stream offset is incorrect.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message