pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Questions about toUnicode Cmap
Date Fri, 09 Mar 2012 06:30:57 GMT

Am 08.03.2012 09:52, schrieb Leleu Eric:
> Hi,
> 2012/3/8 Andreas Lehmkuehler<andreas@lehmi.de>
>> Hi,
>> Am 07.03.2012 09:15, schrieb Leleu Eric:
>>   Hi all,
>>> I'm currently working on the preflight issue PDFBOX-1236 [1]
>>> The error seems to come from the management of the "toUnicode" CMap in a
>>> Type0 font.
>>> The "toUnicode" CMap overrides the "Encoding" CMap of the font. Due to
>>> this
>>> behaviour,
>>> the preflight validator receives the unicode value for each character code
>>> present in a Text operator instead of the CID value present in the
>>> Encoding
>>> CMap.
>> Can you give me a pointer where in the preflight code that exactly happens.
> You can find the Text validation in the
> "org.apache.padaf.preflight.contentstream.ConstentStreamWrapper" class.
> The method is validText(byte[] string).
> We ask the character to the font.encode method to know how many bytes are
> used to describe the CID.
> When we have the CID, the checkCID on the
> "org.apache.padaf.preflight.font.CFFType2FontContainer" is called and an
> exception occurred when we search the GlyphId with this CID.
> If I comment the initialization of the toUnicode map, I found the right
> glyphs.
> The first one is the 'W' glyph58 linked to the CID 1. (If I extract the
> font and I read it with fontforge, the glyph 58 is the 'W' too)
I'll have a look at the weekend.

>>   So I have two questions :
>>> - Is the "Encoding overriding" the right thing to do ?
>>> - Why the "toUnicode" Cmap is used to display text? According to my
>>> understanding of the PDF References v1.7, the toUnicode CMap is used to
>>> extract Text from a PDF File and to create a text file with unicode
>>> characters. To display the text on a PDFReader, the font content and the
>>> Encoding Cmap seem enough.
>> PDFBox uses Graphics2d#drawString and newly java.awt.Font#**createGlyphVector
>> to render the text. The text as to be provided as unicode string when
>> calling those methods.
>> IMO we have to change that in the longrun. It would be better to create
>> the glyphs using the font directly instead of converting it to an AWT-font.
> I don't need to render the Text in the preflight component, I only check
> that the glyph is present and I check the consistency of the width.
> Bypass the AWT-Font will be great but it is a huge work.
Yes, but we need to do that, because some of the needed fonts aren't supported 
or the support is buggy, see PDFBOX-490.

>>   What is your point of view about these two points?
>> Probably we can find a workaround for your issue, but I need some more
>> details on how the preflight code works (see above).
>>   BR,
>>> Eric
>>> [1] https://issues.apache.org/**jira/browse/PDFBOX-1236<https://issues.apache.org/jira/browse/PDFBOX-1236>
>> BR
>> Andreas Lehmkühler
> BR
> Eric

Andreas Lehmkühler

View raw message