pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4311) Unable to parse some pdf's using pdfbox.
Date Tue, 04 Sep 2018 18:57:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603449#comment-16603449

Tilman Hausherr commented on PDFBOX-4311:

The "text" you see is an image, so there is no text to extract. You can see this by trying
to mark and copy and paste in Adobe Reader.


So, sadly, there is nothing we can do this time. Btw the current version is 2.0.11 (but that
won't help either). Sorry for not having better news.

> Unable to parse some pdf's using pdfbox.
> ----------------------------------------
>                 Key: PDFBOX-4311
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4311
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>         Environment: Pdfbox -2.0.9
> Pdfbox-tools - 2.0.9
> Java - 1.7
> Scala - 2.10.6
>            Reporter: Krishna Dheeraj
>            Priority: Major
>         Attachments: upload_user4024353_claimnr283909709_healthpartners_2018-06-17.pdf
> When I tried to convert the PDF file into HTML for parsing the content in the body is
empty and there are no errors or exceptions thrown. It is happening for only few files, others
are are working as expected. I am attaching the file which we are unable to parse. Let us
know know in case of any resolutions are avilable.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message