pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Johnson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4367) Error expected floating point number actual='18-5'
Date Wed, 07 Nov 2018 19:45:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678690#comment-16678690

Peter Johnson commented on PDFBOX-4367:

We are running Tika 1.19.1.  catchIntermediateIOExceptions does seem to be doing it's job;
 if I catch exceptions on the parse() method, then look at my content handler, the text I
am looking for is there!  Including text after the error page.  I think this pretty much
solves this for me.  Thanks!

> Error expected floating point number actual='18-5'
> --------------------------------------------------
>                 Key: PDFBOX-4367
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4367
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.12
>         Environment: Mac OS X Sierra
>            Reporter: Peter Johnson
>            Priority: Minor
> Able to repeat with command line.  Unfortunately, the only files that repeat this are
from a customer, and contain sensitive information.  The file opens without error in Acrobat
Reader and Mac Preview.  The desired result is that any corrupt portions of the PDF are skipped,
so that we can use what text is extractable.
> Unfortunately, I still get an error when using the -force option.
> We get the following stack trace:
> {code:java}
> C02V390UHTD6:Downloads pjohnson$ java -jar pdfbox-app-2.0.12.jar ExtractText 16cccd9af5032a303774f7b87fb95076.pdf
> Nov 02, 2018 10:04:54 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
> WARNING: Corrupt object reference at offset 19727
> Exception in thread "main" java.io.IOException: Error expected floating point number
> at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:78)
> at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:110)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:947)
> at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:631)
> at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:174)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
> at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
> at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> Caused by: java.lang.NumberFormatException
> at java.math.BigDecimal.<init>(BigDecimal.java:494)
> at java.math.BigDecimal.<init>(BigDecimal.java:383)
> at java.math.BigDecimal.<init>(BigDecimal.java:806)
> at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59)
> ... 14 more
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message