pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksandar Putnik (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."
Date Mon, 07 May 2018 08:22:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465619#comment-16465619

Aleksandar Putnik commented on PDFBOX-3438:


I see that the ticket is closed but the title of this ticket fits well to my problem.

Unlike the first case, here I have a document from which the Acrobat Reader (Adobe Acrobat
Reader DC) can extract the text (although not with a 100% precision).

pdfbox 1.8.6 returns question marks while pdfbox 2.0.9 returns nothing (besides those warnings
about missing unicode mapping).

The font in question is ArialMT with custom encoding and the pdf doesn't include toUnicode

`pdftotext` also can't extract anything but only show an error `Syntax Error: Unknown character
collection 'Adobe-ArialMT'`

The possible culprit may also be the pdf producer (used by the customer) - VintaSoft PDF .NET
Plug-in v5.5, but there I'm really not sure.

What should be the next step here?



> only garbage extracted, lots of warnings "No Unicode mapping..."
> ----------------------------------------------------------------
>                 Key: PDFBOX-3438
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3438
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Oliver Steinau
>            Priority: Major
>         Attachments: PDFBOX-3438.diff, PDFBOX-3438.txt, test.pdf
> When I try to extract text from this PDF, I get lots of warnings "No Unicode mapping
for ...", and as output I only get garbage.
> PDF file displays fine in Acrobat Reader, and pdftotext.exe will extract the text just
> PDF file seems to have a Type-1 font embedded with a custom encoding.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message