pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Help needed to resolve issue with converting Arabic characters to presentation forms
Date Thu, 16 Feb 2012 07:02:29 GMT

Am 16.02.2012 05:40, schrieb Hesham G.:
> Hamed ,
> Nice effort .. Thanks for sharing the nice information.
> I hope you will be able to overcome this, and share your solution.
I have to agree, thanks for the details. I also dug deeper into
that part of the code more than once. The issue is the CID-coded
glyph/character mapping. Maybe I'm able to crack that nut with your

> Best regards ,
> Hesham
> ---------------------------------------------
> Included message :
>> Hi,
>> I'm trying to resolve PDFBOX-1216 that I reported a while ago by
>> debugging the PDFBox source code, and I need some advice on what to
>> do. In brief, the issue is that PDFBox doesn't use presentation forms
>> when creating PDF images for Arabic / Persian text in PDF, thus the
>> characters are shown disconnected. I'm not sure yet, but I guess this
>> is called "ligature"?
>> Anyway, here's what I concluded so far, and if anyone could guide me,
>> I may be able to fix this and provide a patch.
>> * In PDF file, different codes are used for different presentation
>> forms of a single unicode character (under Content stream of PDF file,
>> under "TJ" command which is "show text, allowing individual glyph
>> positioning")
>> * In the "ToUnicode" table of PDF file (which is read into the "cmap"
>> variable of PDFont class), all the presentation forms are mapped to
>> the same unicode character (which is not in the presentation range)
>> * When PDFBox is drawing text on graphics canvas, it uses the unicode
>> value in a string and calls "PDSimpleFont.drawStirng" method.
>> * Since the single character is isolated, it is either not found in
>> the Font, or the isolated form (if present) is rendered.
>> Example:
>> You can check characters in the following address:
>> http://en.wikipedia.org/wiki/Arabic_characters_in_Unicode
>> When there is a U+0647 character in the file ( ه ), and should be
>> connected to the character before it, it should appear as U+FEEA ( ﻪ
>> ).
>> In the attached PDF file, this character appears in two different
>> fonts. Internal PDF code for the this character in the fonts are
>> "00C4" and "03EA".
>> When I set a breakpoint in "PDSimpleFont.drawStirng" method, and
>> manually replace the string content with the appropriate presentation
>> form (like "\ufeea" for the above character) everything else works
>> fine and the output image is correct (it is found in the Font, where
>> the original character, "\u0647", is not embedded in the font).
>> PDF viewers have some way of figuring out the presentation forms,
>> because the PDF is displayed correctly in all viewers.
>> But I could not find out how can I determine which character code
>> should be mapped to which presentation form. I'm not very familiar
>> with the internals of PDF file, if any of the developers can guide me
>> on where to look next, I'd hopefully be able to figure out a way to
>> fix this.
>> Thanks in advance
>> Hamed

Andreas Lehmkühler

View raw message