pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: "No Unicode mapping for" when extracting text from a PDF
Date Thu, 04 Jan 2018 19:28:10 GMT
Am 04.01.2018 um 20:20 schrieb Luca Loiodice:
> I am trying to migrate a project from a commercial Windows PDF library 
> to PDFBox, but I see reduced accuracy when I extract text from 
> arbitrary files.
>
> For example, I have a PDF (enclosed) that does not have Unicode 
> mappings for certain glyph ... and so when I try and extract the text 
> using PDF Box I get the following:

Attachments are swallowed, you'd need to upload to a sharehoster.

>
> WARNING: No Unicode mapping for G70 (112) in font HAGLDF+MSTT31c5ed
> Jan 04, 2018 10:24:02 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
>
> The Windows library returns the correct text for the gliph with 
> missing character mapping.
> Is there a way for me to add some code to make PDFBox or my program 
> figure out what the text is in this case ?

Yes, but you'd need to build from source because G70 is non standard, 
the change is described in
https://issues.apache.org/jira/browse/PDFBOX-3962
at the bottom.

Tilman

>
> Thanks for any help,
> Luca
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message