pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maryam Z <maryamuom...@gmail.com>
Subject Re: [Identity-H][Text Extraction] Overwrite toUnicode Mapping
Date Mon, 21 Nov 2016 11:27:19 GMT
Hi All,

Apologies for the late reply. Thank you for all the ideas and resources.

I managed to find a workaround for the issue using the character codes.
Creating a custom mapping for each font gave satisfactory results which
could be used for the task at hand.

@John - Can't/Won't ෙ,  ේ   and  ෝ be considered as 3 separate glyphs since
they have their own Unicode values (\u0DD9, \u0DDA and \u0DDD)? I have
minimal knowledge regarding complex scripts and will look into it in detail
as time permits.

Thank you once again.

On Tue, Nov 1, 2016 at 3:09 AM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Am 30.10.2016 um 07:46 schrieb Maryam Z:
>
>> I came across the solution for the Identity-h problem for CJK fonts using
>> CMap files, but the CMap files for these two fonts are not available.
>>
>> I would be grateful if you could let me know if there is any way to
>> overwrite the toUnicode map and use a custom map in extraction, which
>> correctly maps glyphs to values, or if there is any other effective
>> solution for this problem.
>>
>
>
> Not sure if it helps (possibly not, as complex scripts were mentioned
> later), but this answer
>
> https://stackoverflow.com/questions/39485920/how-to-add-unic
> ode-in-truetype0font-on-pdfbox-2-0-0
>
> shows how to create a tounicode cmap for a specific case.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message