pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Hewson (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-1824) [PATCH] CFF fonts render wrong glyphs
Date Thu, 02 Jan 2014 08:39:50 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

John Hewson updated PDFBOX-1824:
--------------------------------

    Description: 
I've found three very closely related CFF encoding issues in v2.0.0 when using PDFToImage.

Problem 1
---------

Look a line 7 of the poem, it should be "And the mouldering dust that years have made"
but instead says "Afld the fioulderiflg dust that years have fiade"

The CFF font is asseumed to use CIDs but it does not if its not a ROS font.
Therefore we add a check for CFF ROS class.

Patch 1 fixes this.

Problem 2
---------

Look at line 3 "of right shoice" should be "of right choice".
Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a staunch",
the st and ch ligatures are incorrect.

This is because the font is an CFF ROS CID Font and the glyphs for the st and ch ligatures
both have no name. The CFF format achieves this by using SIDs beyond the size of the string
index, which map to .notdef. So there is a unique SID for each glyph, but not a unique name.

Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique names, and this
assumtion appears throughout the codebase. Because a glyph name and a SID perform essentially
the same role, I recommend a simple solution to the problem: when an SID beyond the size of
the string index is encounteted, instead of mapping it to .notdef it should be mapped to 
a new name with the prefix "SID" for example mapping SID 409 to the name "SID409". That way
each glyph will have a unique name, which is what PDFbox assumes.

Patch 2 fixes this.

Problem 3
---------

Look at line 2, "That creepeth oÉer ruins old!" the word "o'er" is incorrectly rendered
as "oÉer". This is because the Encoding entry in the PDF maps code 201 from "Eacute" in the
base encoding to "quoteright", but this is being ignored by PDFBox.

In the CFFGlyph2D constructor PDFBox examines the font's built-in charset. When the name
"quoteright" is encountered it is looked up in the PDF Encoding (i.e. nameToCode) where
it is changed to code 201. Thus code 201 is associated with the "quoteright" glyph in the
codeToGlyph map. This is correct. 

However, later when the "Eacute" glyph is encountered, its built-in charset code is also
201 (which is standard) and so the codeToGlyph map entry is overwritten, resulting in
code 201 being associated with the "Eacute" glyph. 

The solution is to build the codeToGlyph map in a strict order: first populate it with the
font's built-in charset, then the PDF Encoding overwrites any entries which it defines.

Patch 3 fixes this (and also replaces patch 2)


  was:
I've found three very closely related CFF encoding issues in v2.0.0

Problem 1
---------

Look a line 7 of the poem, it should be "And the mouldering dust that years have made"
but instead says "Afld the fioulderiflg dust that years have fiade"

The CFF font is asseumed to use CIDs but it does not if its not a ROS font.
Therefore we add a check for CFF ROS class.

Patch 1 fixes this.

Problem 2
---------

Look at line 3 "of right shoice" should be "of right choice".
Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a staunch",
the st and ch ligatures are incorrect.

This is because the font is an CFF ROS CID Font and the glyphs for the st and ch ligatures
both have no name. The CFF format achieves this by using SIDs beyond the size of the string
index, which map to .notdef. So there is a unique SID for each glyph, but not a unique name.

Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique names, and this
assumtion appears throughout the codebase. Because a glyph name and a SID perform essentially
the same role, I recommend a simple solution to the problem: when an SID beyond the size of
the string index is encounteted, instead of mapping it to .notdef it should be mapped to 
a new name with the prefix "SID" for example mapping SID 409 to the name "SID409". That way
each glyph will have a unique name, which is what PDFbox assumes.

Patch 2 fixes this.

Problem 3
---------

Look at line 2, "That creepeth oÉer ruins old!" the word "o'er" is incorrectly rendered
as "oÉer". This is because the Encoding entry in the PDF maps code 201 from "Eacute" in the
base encoding to "quoteright", but this is being ignored by PDFBox.

In the CFFGlyph2D constructor PDFBox examines the font's built-in charset. When the name
"quoteright" is encountered it is looked up in the PDF Encoding (i.e. nameToCode) where
it is changed to code 201. Thus code 201 is associated with the "quoteright" glyph in the
codeToGlyph map. This is correct. 

However, later when the "Eacute" glyph is encountered, its built-in charset code is also
201 (which is standard) and so the codeToGlyph map entry is overwritten, resulting in
code 201 being associated with the "Eacute" glyph. 

The solution is to build the codeToGlyph map in a strict order: first populate it with the
font's built-in charset, then the PDF Encoding overwrites any entries which it defines.

Patch 3 fixes this (and also replaces patch 2)



> [PATCH] CFF fonts render wrong glyphs
> -------------------------------------
>
>                 Key: PDFBOX-1824
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1824
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>              Labels: patch
>         Attachments: 1.patch, 2.patch, 3.patch, all.patch, calluna-11.pdf, patched.jpg,
trunk.jpg
>
>
> I've found three very closely related CFF encoding issues in v2.0.0 when using PDFToImage.
> Problem 1
> ---------
> Look a line 7 of the poem, it should be "And the mouldering dust that years have made"
> but instead says "Afld the fioulderiflg dust that years have fiade"
> The CFF font is asseumed to use CIDs but it does not if its not a ROS font.
> Therefore we add a check for CFF ROS class.
> Patch 1 fixes this.
> Problem 2
> ---------
> Look at line 3 "of right shoice" should be "of right choice".
> Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a staunch",
> the st and ch ligatures are incorrect.
> This is because the font is an CFF ROS CID Font and the glyphs for the st and ch ligatures
> both have no name. The CFF format achieves this by using SIDs beyond the size of the
string
> index, which map to .notdef. So there is a unique SID for each glyph, but not a unique
name.
> Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique names, and this
> assumtion appears throughout the codebase. Because a glyph name and a SID perform essentially
> the same role, I recommend a simple solution to the problem: when an SID beyond the size
of
> the string index is encounteted, instead of mapping it to .notdef it should be mapped
to 
> a new name with the prefix "SID" for example mapping SID 409 to the name "SID409". That
way
> each glyph will have a unique name, which is what PDFbox assumes.
> Patch 2 fixes this.
> Problem 3
> ---------
> Look at line 2, "That creepeth oÉer ruins old!" the word "o'er" is incorrectly rendered
> as "oÉer". This is because the Encoding entry in the PDF maps code 201 from "Eacute"
in the
> base encoding to "quoteright", but this is being ignored by PDFBox.
> In the CFFGlyph2D constructor PDFBox examines the font's built-in charset. When the name
> "quoteright" is encountered it is looked up in the PDF Encoding (i.e. nameToCode) where
> it is changed to code 201. Thus code 201 is associated with the "quoteright" glyph in
the
> codeToGlyph map. This is correct. 
> However, later when the "Eacute" glyph is encountered, its built-in charset code is also
> 201 (which is standard) and so the codeToGlyph map entry is overwritten, resulting in
> code 201 being associated with the "Eacute" glyph. 
> The solution is to build the codeToGlyph map in a strict order: first populate it with
the
> font's built-in charset, then the PDF Encoding overwrites any entries which it defines.
> Patch 3 fixes this (and also replaces patch 2)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message