pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luca Loiodice <loiod...@csdisco.com>
Subject Re: "No Unicode mapping for" when extracting text from a PDF
Date Thu, 04 Jan 2018 19:57:09 GMT
Thanks.

Any chance I can add the conversion as a post processing step and avoid
having to build from source?
Because I get the code back as part of the extracted text ... so I was
wondering if I can load the font from the PDF
and use the code -> glyph name matrix to replace the code with the
character.

In that case I am not sure how I can load the data from the font ... but I
see the debugger is able to do it.








*Luca Loiodice |* Software Architect



*T: *713 231 9100    *F: *713 583 1131  *C:* 512 577 6677
4400 Post Oak Parkway, Suite 2700, Houston, TX 77027

Follow Us: Facebook <https://t.xink.io/Tracking/Index/vwUAACcuAAAqdCYA0> |
LinkedIn <https://t.xink.io/Tracking/Index/wAUAACcuAAAqdCYA0> | Twitter
<https://t.xink.io/Tracking/Index/wQUAACcuAAAqdCYA0> | Youtube
<https://t.xink.io/Tracking/Index/wgUAACcuAAAqdCYA0>


On Thu, Jan 4, 2018 at 7:28 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Am 04.01.2018 um 20:20 schrieb Luca Loiodice:
>
>> I am trying to migrate a project from a commercial Windows PDF library to
>> PDFBox, but I see reduced accuracy when I extract text from arbitrary files.
>>
>> For example, I have a PDF (enclosed) that does not have Unicode mappings
>> for certain glyph ... and so when I try and extract the text using PDF Box
>> I get the following:
>>
>
> Attachments are swallowed, you'd need to upload to a sharehoster.
>
>
>> WARNING: No Unicode mapping for G70 (112) in font HAGLDF+MSTT31c5ed
>> Jan 04, 2018 10:24:02 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont
>> toUnicode
>>
>> The Windows library returns the correct text for the gliph with missing
>> character mapping.
>> Is there a way for me to add some code to make PDFBox or my program
>> figure out what the text is in this case ?
>>
>
> Yes, but you'd need to build from source because G70 is non standard, the
> change is described in
> https://issues.apache.org/jira/browse/PDFBOX-3962
> at the bottom.
>
> Tilman
>
>
>> Thanks for any help,
>> Luca
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message