pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksandar Putnik (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4210) Unable to extract the text from a PDF ("No Unicode mapping.." warnings)
Date Tue, 08 May 2018 11:24:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467253#comment-16467253

Aleksandar Putnik commented on PDFBOX-4210:

I'm using the selection tool and then copy/paste it into some text editor.

For "Save As Other -> Text ..." I also get an empty text file.

"Find" also works (except for the text in quotes)

The version of Adobe Acrobat Reader DC is 2018.011.20038

> Unable to extract the text from a PDF ("No Unicode mapping.." warnings)
> -----------------------------------------------------------------------
>                 Key: PDFBOX-4210
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4210
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>            Reporter: Aleksandar Putnik
>            Priority: Major
>         Attachments: Testdokument.pdf
> I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from PDF.
> I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can extract
the text (although not with a 100% precision).
> Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox 2.0.9
doesn't return anything.
> As you can see from the warning, the font in question is ArialMT. It is custom encoding
and the pdf doesn't include toUnicode mapping. Font type is CID TrueType (this info is provided
by "pdffonts")
> "pdftotext" also can't extract anything but only shows an error `Syntax Error: Unknown
character collection 'Adobe-ArialMT'`
> The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.
> I would like to determine whether there is a bug in pdfbox or the pdf producer has to
adjust and improve the "readability" of pdf.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message