pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Help needed to resolve issue with converting Arabic characters to presentation forms
Date Sat, 18 Feb 2012 15:35:18 GMT
Hi

Am 18.02.2012 14:40, schrieb Hamed Iravanchi:
> Hi again,
>
> Regarding the CID-coded glyph/character mapping, and the I have some
> more findings that I want to share, maybe one of you guys can point
> out something that can help me get there faster.
>
> Using Adobe Acrobat, I was able to dig deep in the PDF file structure,
> and see how the data is being read by PDFBox.
>
> There are two utilities in the "options" menu of Adobe Acrobat "Preflight" tool:
> * "Browse Internal PDF Structure"
PDFBox also provides a tool (PDFDebugger) to browse the internal structure of a 
pdf.

> * "Browse Internal Structure of All Document Fonts"
>
> In the first one, I could find the "ToUnicode" mapping that I talked
> about before in the font resources. The font is a type-0 one, which
> has a "CIDFontType2" descendant font. The "awtFont" used to draw
> characters on graphics object is read from the "FontFile2" stream
> inside this object in PDF.
>
> There is no CID mappings in this font. CIDToGIDMap is "Identity". I'll
> include a screenshot of this in the email.
>
> On the other hand, the second option ("Browse Internal Structure of
> All Document Fonts") contains glyph details, and ALSO correct CID
> mappings. It's in the following path:
> Font>  Internal Structure>  Data Tables>  Character to Glyph Mapping ('cmap')
>
> For each character, the data contains both correct UNICODE value
> (either original or representation) and correct Glyph code.
>
> In the PDFBox, if I map the CID to correct UNICODE value from this
> table, it should work fine. But I could not find anywhere in the
> PDFBox code that such mappings are read from the PDF file, and I have
> no idea where in PDF file is such information stored.
>
> If anyone has an idea, please let me know.
I guess I've cracked the nut. :-)

- PDFBox uses strings to be rendered, the same which are used for text extraction
- in case of CID-encoded fonts the ToUnicode-mapping is used to get readable 
strings, but these strings can't be used to draw the string
- in case of CID-encoded fonts we have to use the font internal id to adress the 
glyphs

I have to clean up the code and run some tests before checking in the code.

> Thanks a lot,
> Hamed
We have to thank you, your detailed analysis helped me to find out what piece of 
code is still missing.

> -- Original Message:
>
> Hi,
>
> Am 16.02.2012 05:40, schrieb Hesham G.:
>
> Hamed ,
>
> Nice effort .. Thanks for sharing the nice information. I hope you
> will be able to overcome this, and share your solution.
>
> I have to agree, thanks for the details. I also dug deeper into that
> part of the code more than once. The issue is the CID-coded
> glyph/character mapping. Maybe I'm able to crack that nut with your
> information.
>
> Best regards , Hesham
>
> --------------------------------------------- Included message :
<SNIP>

BR
Andreas Lehmkühler

Mime
View raw message