pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamed Iravanchi <ha...@iravanchi.com>
Subject Re: Help needed to resolve issue with converting Arabic characters to presentation forms
Date Sat, 18 Feb 2012 13:40:33 GMT
Hi again,

Regarding the CID-coded glyph/character mapping, and the I have some
more findings that I want to share, maybe one of you guys can point
out something that can help me get there faster.

Using Adobe Acrobat, I was able to dig deep in the PDF file structure,
and see how the data is being read by PDFBox.

There are two utilities in the "options" menu of Adobe Acrobat "Preflight" tool:
* "Browse Internal PDF Structure"
* "Browse Internal Structure of All Document Fonts"

In the first one, I could find the "ToUnicode" mapping that I talked
about before in the font resources. The font is a type-0 one, which
has a "CIDFontType2" descendant font. The "awtFont" used to draw
characters on graphics object is read from the "FontFile2" stream
inside this object in PDF.

There is no CID mappings in this font. CIDToGIDMap is "Identity". I'll
include a screenshot of this in the email.

On the other hand, the second option ("Browse Internal Structure of
All Document Fonts") contains glyph details, and ALSO correct CID
mappings. It's in the following path:
Font > Internal Structure > Data Tables > Character to Glyph Mapping ('cmap')

For each character, the data contains both correct UNICODE value
(either original or representation) and correct Glyph code.

In the PDFBox, if I map the CID to correct UNICODE value from this
table, it should work fine. But I could not find anywhere in the
PDFBox code that such mappings are read from the PDF file, and I have
no idea where in PDF file is such information stored.

If anyone has an idea, please let me know.

Thanks a lot,
Hamed

-- Original Message:

Hi,

Am 16.02.2012 05:40, schrieb Hesham G.:

Hamed ,

Nice effort .. Thanks for sharing the nice information. I hope you
will be able to overcome this, and share your solution.

I have to agree, thanks for the details. I also dug deeper into that
part of the code more than once. The issue is the CID-coded
glyph/character mapping. Maybe I'm able to crack that nut with your
information.

Best regards , Hesham

--------------------------------------------- Included message :

Hi,

I'm trying to resolve PDFBOX-1216 that I reported a while ago by
debugging the PDFBox source code, and I need some advice on what to
do. In brief, the issue is that PDFBox doesn't use presentation forms
when creating PDF images for Arabic / Persian text in PDF, thus the
characters are shown disconnected. I'm not sure yet, but I guess this
is called "ligature"?

Anyway, here's what I concluded so far, and if anyone could guide me,
I may be able to fix this and provide a patch.

* In PDF file, different codes are used for different presentation
forms of a single unicode character (under Content stream of PDF file,
under "TJ" command which is "show text, allowing individual glyph
positioning")

* In the "ToUnicode" table of PDF file (which is read into the "cmap"
variable of PDFont class), all the presentation forms are mapped to
the same unicode character (which is not in the presentation range)

* When PDFBox is drawing text on graphics canvas, it uses the unicode
value in a string and calls "PDSimpleFont.drawStirng" method.

* Since the single character is isolated, it is either not found in
the Font, or the isolated form (if present) is rendered.

Example:

You can check characters in the following address:
http://en.wikipedia.org/wiki/Arabic_characters_in_Unicode

When there is a U+0647 character in the file ( ه ), and should be
connected to the character before it, it should appear as U+FEEA ( ﻪ
). In the attached PDF file, this character appears in two different
fonts. Internal PDF code for the this character in the fonts are
"00C4" and "03EA".

When I set a breakpoint in "PDSimpleFont.drawStirng" method, and
manually replace the string content with the appropriate presentation
form (like "\ufeea" for the above character) everything else works
fine and the output image is correct (it is found in the Font, where
the original character, "\u0647", is not embedded in the font).

PDF viewers have some way of figuring out the presentation forms,
because the PDF is displayed correctly in all viewers.

But I could not find out how can I determine which character code
should be mapped to which presentation form. I'm not very familiar
with the internals of PDF file, if any of the developers can guide me
on where to look next, I'd hopefully be able to figure out a way to
fix this.

Thanks in advance Hamed

BR Andreas Lehmkühler

Mime
View raw message