pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-4250) PDF File with embedded fonts: text extraction fails or returns junk characters
Date Sun, 24 Jun 2018 20:12:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521599#comment-16521599
] 

Tilman Hausherr commented on PDFBOX-4250:
-----------------------------------------

I've seen this question before… could it be you asked it elsewhere?

> PDF File with embedded fonts: text extraction fails or returns junk characters
> ------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4250
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4250
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>            Reporter: Bob Swanson
>            Priority: Major
>
>  One of the people that I support created a PDF file from an LibreOffice document, and
then misplaced the original document. I believed that I could use PDFBox to extract the text
from the PDF, and at least provide that information to the user.
> 
When I ran the text extractor from the "app" jar, on their PDF file I got the  following
types of messages (many):
> 
...
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimple
> ont toUnicode
> WARNING: No Unicode mapping for 7 (7) in
> font EXIRGE+Ubuntu
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimpleont toUnicode
> WARNING: No Unicode mapping for 8 (8) in
> font EXIRGE+Ubuntu
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimple
> ont toUnicode
> WARNING: No Unicode mapping for 1 (1) in
> font JTPICY+AndaleMono
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimple
> ont toUnicode
> ...
> 
The resulting "txt" file is just binary numbers, unless the font is one of the "standard".
I ran
> the debugger on the PDF file and saw that several fonts were embedded, and thus used
low numbers for encoding (1,2,3, etc).
> 
When viewed, the PDF file looks good, but nothing can be copied or pasted from the
display (again,standard font seems OK).
> 
The original file was of a sensitive nature, so I was able to re-create the problem
with a simpler file.
> 
Running on Ubuntu 16.04
> LibreOffice was used to "print" on the cups-pdf "printer" (which may  be part of the
problem).
> 
Text extract was attempted with pdfbox-app-2.0.9.jar
> 
PDF file is at:
> 
http://swansongrp.com/misc/mytest3.pdf
> 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message