pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamed Iravanchi <ha...@iravanchi.com>
Subject Re: Help needed to resolve issue with converting Arabic characters to presentation forms
Date Sat, 18 Feb 2012 17:52:35 GMT
Hi again.

Thanks for ur attention to the issue.
I actually checked,  and saw that the font itself (ttf stream) contains the
correct cmap. If we can draw the text using glyph ID instead of
characters,  the font knows the right characters to draw.

I checked the Font class instance in the debugger,  it contains a cmap
which is exactly right. First I was looking for ways to take the mapping
from the font (since it is private member,  specific to Sun impl).

But I realized we could ask the font to draw glyphs instead of characters.
But i couldn't still find a right way to draw a glyph on graphics.

BTW,  I also can do the implementation and send u a patch once I realize
what to do. Thanks for ur encouragement :-)

- Hamed
 On Feb 18, 2012 7:05 PM, "Andreas Lehmkuehler" <andreas@lehmi.de> wrote:

> Hi
>
> Am 18.02.2012 14:40, schrieb Hamed Iravanchi:
>
>> Hi again,
>>
>> Regarding the CID-coded glyph/character mapping, and the I have some
>> more findings that I want to share, maybe one of you guys can point
>> out something that can help me get there faster.
>>
>> Using Adobe Acrobat, I was able to dig deep in the PDF file structure,
>> and see how the data is being read by PDFBox.
>>
>> There are two utilities in the "options" menu of Adobe Acrobat
>> "Preflight" tool:
>> * "Browse Internal PDF Structure"
>>
> PDFBox also provides a tool (PDFDebugger) to browse the internal structure
> of a pdf.
>
>  * "Browse Internal Structure of All Document Fonts"
>>
>> In the first one, I could find the "ToUnicode" mapping that I talked
>> about before in the font resources. The font is a type-0 one, which
>> has a "CIDFontType2" descendant font. The "awtFont" used to draw
>> characters on graphics object is read from the "FontFile2" stream
>> inside this object in PDF.
>>
>> There is no CID mappings in this font. CIDToGIDMap is "Identity". I'll
>> include a screenshot of this in the email.
>>
>> On the other hand, the second option ("Browse Internal Structure of
>> All Document Fonts") contains glyph details, and ALSO correct CID
>> mappings. It's in the following path:
>> Font>  Internal Structure>  Data Tables>  Character to Glyph Mapping
>> ('cmap')
>>
>> For each character, the data contains both correct UNICODE value
>> (either original or representation) and correct Glyph code.
>>
>> In the PDFBox, if I map the CID to correct UNICODE value from this
>> table, it should work fine. But I could not find anywhere in the
>> PDFBox code that such mappings are read from the PDF file, and I have
>> no idea where in PDF file is such information stored.
>>
>> If anyone has an idea, please let me know.
>>
> I guess I've cracked the nut. :-)
>
> - PDFBox uses strings to be rendered, the same which are used for text
> extraction
> - in case of CID-encoded fonts the ToUnicode-mapping is used to get
> readable strings, but these strings can't be used to draw the string
> - in case of CID-encoded fonts we have to use the font internal id to
> adress the glyphs
>
> I have to clean up the code and run some tests before checking in the code.
>
>  Thanks a lot,
>> Hamed
>>
> We have to thank you, your detailed analysis helped me to find out what
> piece of code is still missing.
>
>  -- Original Message:
>>
>> Hi,
>>
>> Am 16.02.2012 05:40, schrieb Hesham G.:
>>
>> Hamed ,
>>
>> Nice effort .. Thanks for sharing the nice information. I hope you
>> will be able to overcome this, and share your solution.
>>
>> I have to agree, thanks for the details. I also dug deeper into that
>> part of the code more than once. The issue is the CID-coded
>> glyph/character mapping. Maybe I'm able to crack that nut with your
>> information.
>>
>> Best regards , Hesham
>>
>> ------------------------------**--------------- Included message :
>>
> <SNIP>
>
> BR
> Andreas Lehmkühler
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message