pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-1824) [PATCH] CFF fonts render wrong glyphs
Date Fri, 03 Jan 2014 19:03:55 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861784#comment-13861784
] 

Andreas Lehmkühler commented on PDFBOX-1824:
--------------------------------------------

[~jahewson] Thanks for the fast patch. I've one too, but I'm still testing to avoid side effects.

The remaining issue maybe related to PDFBOX-1691.

> [PATCH] CFF fonts render wrong glyphs
> -------------------------------------
>
>                 Key: PDFBOX-1824
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1824
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: Andreas Lehmkühler
>              Labels: patch
>             Fix For: 2.0.0
>
>         Attachments: 1.patch, 2.patch, 3.patch, Bimbo_Historia_20070409_Esp.pdf-2-rev-1554775.png,
Bimbo_Historia_20070409_Esp.pdf-2-rev-current.png, all.patch, bimbo_historia-patched.jpg,
bimbo_historia.patch, calluna-11.pdf, patched.jpg, trunk.jpg
>
>
> I've found three very closely related CFF encoding issues in v2.0.0 when using PDFToImage.
> Problem 1
> ---------
> Look a line 7 of the poem, it should be "And the mouldering dust that years have made"
> but instead says "Afld the fioulderiflg dust that years have fiade"
> The CFF font is asseumed to use CIDs but it does not if its not a ROS font.
> Therefore we add a check for CFF ROS class.
> Patch 1 fixes this.
> Problem 2
> ---------
> Look at line 3 "of right shoice" should be "of right choice".
> Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a staunch",
> the st and ch ligatures are incorrect.
> This is because the font is an CFF ROS CID Font and the glyphs for the st and ch ligatures
> both have no name. The CFF format achieves this by using SIDs beyond the size of the
string
> index, which map to .notdef. So there is a unique SID for each glyph, but not a unique
name.
> Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique names, and this
> assumtion appears throughout the codebase. Because a glyph name and a SID perform essentially
> the same role, I recommend a simple solution to the problem: when an SID beyond the size
of
> the string index is encounteted, instead of mapping it to .notdef it should be mapped
to 
> a new name with the prefix "SID" for example mapping SID 409 to the name "SID409". That
way
> each glyph will have a unique name, which is what PDFbox assumes.
> Patch 2 fixes this.
> Problem 3
> ---------
> Look at line 2, "That creepeth oÉer ruins old!" the word "o'er" is incorrectly rendered
> as "oÉer". This is because the Encoding entry in the PDF maps code 201 from "Eacute"
in the
> base encoding to "quoteright", but this is being ignored by PDFBox.
> In the CFFGlyph2D constructor PDFBox examines the font's built-in charset. When the name
> "quoteright" is encountered it is looked up in the PDF Encoding (i.e. nameToCode) where
> it is changed to code 201. Thus code 201 is associated with the "quoteright" glyph in
the
> codeToGlyph map. This is correct. 
> However, later when the "Eacute" glyph is encountered, its built-in charset code is also
> 201 (which is standard) and so the codeToGlyph map entry is overwritten, resulting in
> code 201 being associated with the "Eacute" glyph. 
> The solution is to build the codeToGlyph map in a strict order: first populate it with
the
> font's built-in charset, then the PDF Encoding overwrites any entries which it defines.
> Patch 3 fixes this (and also replaces patch 2)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message