pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: "No Unicode mapping for" when extracting text from a PDF
Date Thu, 04 Jan 2018 20:19:56 GMT
Am 04.01.2018 um 20:57 schrieb Luca Loiodice:
> Thanks.
>
> Any chance I can add the conversion as a post processing step and avoid
> having to build from source?

No... ExtractText returns nothing fpr these glyphs.

It's different if you work with TextPosition... then in theory, you 
could try to replicate all the steps from the source code.

> Because I get the code back as part of the extracted text ... so I was
> wondering if I can load the font from the PDF
> and use the code -> glyph name matrix to replace the code with the
> character.
>
> In that case I am not sure how I can load the data from the font ... but I
> see the debugger is able to do it.

That is also available in source.

But if I were you, I wouldn't bother with such files. There are many 
files where there is no unicode available. You'll have to use OCR for that.

Tilman

>
>
>
>
>
>
>
>
> *Luca Loiodice |* Software Architect
>
>
>
> *T: *713 231 9100    *F: *713 583 1131  *C:* 512 577 6677
> 4400 Post Oak Parkway, Suite 2700, Houston, TX 77027
>
> Follow Us: Facebook <https://t.xink.io/Tracking/Index/vwUAACcuAAAqdCYA0> |
> LinkedIn <https://t.xink.io/Tracking/Index/wAUAACcuAAAqdCYA0> | Twitter
> <https://t.xink.io/Tracking/Index/wQUAACcuAAAqdCYA0> | Youtube
> <https://t.xink.io/Tracking/Index/wgUAACcuAAAqdCYA0>
>
>
> On Thu, Jan 4, 2018 at 7:28 PM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 04.01.2018 um 20:20 schrieb Luca Loiodice:
>>
>>> I am trying to migrate a project from a commercial Windows PDF library to
>>> PDFBox, but I see reduced accuracy when I extract text from arbitrary files.
>>>
>>> For example, I have a PDF (enclosed) that does not have Unicode mappings
>>> for certain glyph ... and so when I try and extract the text using PDF Box
>>> I get the following:
>>>
>> Attachments are swallowed, you'd need to upload to a sharehoster.
>>
>>
>>> WARNING: No Unicode mapping for G70 (112) in font HAGLDF+MSTT31c5ed
>>> Jan 04, 2018 10:24:02 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont
>>> toUnicode
>>>
>>> The Windows library returns the correct text for the gliph with missing
>>> character mapping.
>>> Is there a way for me to add some code to make PDFBox or my program
>>> figure out what the text is in this case ?
>>>
>> Yes, but you'd need to build from source because G70 is non standard, the
>> change is described in
>> https://issues.apache.org/jira/browse/PDFBOX-3962
>> at the bottom.
>>
>> Tilman
>>
>>
>>> Thanks for any help,
>>> Luca
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message