pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Questions about toUnicode Cmap
Date Tue, 13 Mar 2012 18:10:19 GMT
Hi

Am 09.03.2012 07:30, schrieb Andreas Lehmkuehler:
> Hi,
>
> Am 08.03.2012 09:52, schrieb Leleu Eric:
>> Hi,
>>
>> 2012/3/8 Andreas Lehmkuehler<andreas@lehmi.de>
>>
>>> Hi,
>>>
>>> Am 07.03.2012 09:15, schrieb Leleu Eric:
>>>
>>> Hi all,
>>>>
>>>>
>>>> I'm currently working on the preflight issue PDFBOX-1236 [1]
>>>>
>>>> The error seems to come from the management of the "toUnicode" CMap in a
>>>> Type0 font.
>>>>
>>>> The "toUnicode" CMap overrides the "Encoding" CMap of the font. Due to
>>>> this
>>>> behaviour,
>>>> the preflight validator receives the unicode value for each character code
>>>> present in a Text operator instead of the CID value present in the
>>>> Encoding
>>>> CMap.
>>>>
>>> Can you give me a pointer where in the preflight code that exactly happens.
>>
>> You can find the Text validation in the
>> "org.apache.padaf.preflight.contentstream.ConstentStreamWrapper" class.
>> The method is validText(byte[] string).
>>
>> We ask the character to the font.encode method to know how many bytes are
>> used to describe the CID.
>> When we have the CID, the checkCID on the
>> "org.apache.padaf.preflight.font.CFFType2FontContainer" is called and an
>> exception occurred when we search the GlyphId with this CID.
>>
>> If I comment the initialization of the toUnicode map, I found the right
>> glyphs.
>> The first one is the 'W' glyph58 linked to the CID 1. (If I extract the
>> font and I read it with fontforge, the glyph 58 is the 'W' too)
> I'll have a look at the weekend.
>
>>> So I have two questions :
>>>> - Is the "Encoding overriding" the right thing to do ?
>>>> - Why the "toUnicode" Cmap is used to display text? According to my
>>>> understanding of the PDF References v1.7, the toUnicode CMap is used to
>>>> extract Text from a PDF File and to create a text file with unicode
>>>> characters. To display the text on a PDFReader, the font content and the
>>>> Encoding Cmap seem enough.
>>>>
>>> PDFBox uses Graphics2d#drawString and newly java.awt.Font#**createGlyphVector
>>> to render the text. The text as to be provided as unicode string when
>>> calling those methods.
>>> IMO we have to change that in the longrun. It would be better to create
>>> the glyphs using the font directly instead of converting it to an AWT-font.
>>>
>>
>> I don't need to render the Text in the preflight component, I only check
>> that the glyph is present and I check the consistency of the width.
>>
>> Bypass the AWT-Font will be great but it is a huge work.
> Yes, but we need to do that, because some of the needed fonts aren't supported
> or the support is buggy, see PDFBOX-490.
>
>>> What is your point of view about these two points?
>>>>
>>> Probably we can find a workaround for your issue, but I need some more
>>> details on how the preflight code works (see above).
I had a look and I guess there is no workaround.

I don't know the origin purpose of PDFont#encode but nowadays it tries to
provide a readable version of the encoded text. AFAIK it's used in 3 different
cases:

- text extraction: works fine as long as PDFBox knows how to encode the text
- rendering: the rendering uses java.awt.Font#drawString and therefore it also
needs the readable text. BUT this doesn't work in many cases (CID fonts, 
substituted fonts etc.). In the long run we have to use the cid too to support
every kind of font
- preflight: ContentStreamWrapper#validText expects to get the CID when calling
PDFont#encode but that only works if cid == string

To make it more complicated, the encoding cmap is overwritten if a ToUnicode
cmap is used at the same time.

TODO:

- separate the ToUnicode cmap from the encoding cmap
- split PDFont#encode, to get one methode providing the string and one providing
the cid.


 >>> BR,
>>>> Eric
>>>>
>>>> [1]
>>>> https://issues.apache.org/**jira/browse/PDFBOX-1236<https://issues.apache.org/jira/browse/PDFBOX-1236>
>>>>
>>>>
>>>
>>> BR
>>> Andreas Lehmkühler
>>>
>>
>> BR
>> Eric
>
>

BR
Andreas Lehmkühler


Mime
View raw message