Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@pdfbox.apache.org
Date: Fri, 3 Jan 2014 19:00:04 +0000 (UTC)
From: "John Hewson (JIRA)" <jira@apache.org>
To: dev@pdfbox.apache.org
Message-ID: <JIRA.12686938.1388651381428.39521.1388775604513@arcas>
In-Reply-To: <JIRA.12686938.1388651381428@arcas>
References: <JIRA.12686938.1388651381428@arcas>
Subject: [jira] [Updated] (PDFBOX-1824) [PATCH] CFF fonts render wrong
 glyphs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/PDFBOX-1824?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson updated PDFBOX-1824:
--------------------------------

    Attachment: bimbo_historia-patched.jpg

I've attached "bimbo_historia-patched.jpg" which shows the result of the ne=
w patch. Note that some of the glyphs are missing, which I can confirm is a=
 problem related to parsing Type1 CharStrings and is *not* an Encoding issu=
e. I will open a new bug for this with more details once the current issue =
is closed.

> [PATCH] CFF fonts render wrong glyphs
> -------------------------------------
>
>                 Key: PDFBOX-1824
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1824
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: Andreas Lehmk=C3=BChler
>              Labels: patch
>             Fix For: 2.0.0
>
>         Attachments: 1.patch, 2.patch, 3.patch, Bimbo_Historia_20070409_E=
sp.pdf-2-rev-1554775.png, Bimbo_Historia_20070409_Esp.pdf-2-rev-current.png=
, all.patch, bimbo_historia-patched.jpg, bimbo_historia.patch, calluna-11.p=
df, patched.jpg, trunk.jpg
>
>
> I've found three very closely related CFF encoding issues in v2.0.0 when =
using PDFToImage.
> Problem 1
> ---------
> Look a line 7 of the poem, it should be "And the mouldering dust that yea=
rs have made"
> but instead says "Afld the fioulderiflg dust that years have fiade"
> The CFF font is asseumed to use CIDs but it does not if its not a ROS fon=
t.
> Therefore we add a check for CFF ROS class.
> Patch 1 fixes this.
> Problem 2
> ---------
> Look at line 3 "of right shoice" should be "of right choice".
> Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a =
staunch",
> the st and ch ligatures are incorrect.
> This is because the font is an CFF ROS CID Font and the glyphs for the st=
 and ch ligatures
> both have no name. The CFF format achieves this by using SIDs beyond the =
size of the string
> index, which map to .notdef. So there is a unique SID for each glyph, but=
 not a unique name.
> Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique na=
mes, and this
> assumtion appears throughout the codebase. Because a glyph name and a SID=
 perform essentially
> the same role, I recommend a simple solution to the problem: when an SID =
beyond the size of
> the string index is encounteted, instead of mapping it to .notdef it shou=
ld be mapped to=20
> a new name with the prefix "SID" for example mapping SID 409 to the name =
"SID409". That way
> each glyph will have a unique name, which is what PDFbox assumes.
> Patch 2 fixes this.
> Problem 3
> ---------
> Look at line 2, "That creepeth o=C3=89er ruins old!" the word "o'er" is i=
ncorrectly rendered
> as "o=C3=89er". This is because the Encoding entry in the PDF maps code 2=
01 from "Eacute" in the
> base encoding to "quoteright", but this is being ignored by PDFBox.
> In the CFFGlyph2D constructor PDFBox examines the font's built-in charset=
. When the name
> "quoteright" is encountered it is looked up in the PDF Encoding (i.e. nam=
eToCode) where
> it is changed to code 201. Thus code 201 is associated with the "quoterig=
ht" glyph in the
> codeToGlyph map. This is correct.=20
> However, later when the "Eacute" glyph is encountered, its built-in chars=
et code is also
> 201 (which is standard) and so the codeToGlyph map entry is overwritten, =
resulting in
> code 201 being associated with the "Eacute" glyph.=20
> The solution is to build the codeToGlyph map in a strict order: first pop=
ulate it with the
> font's built-in charset, then the PDF Encoding overwrites any entries whi=
ch it defines.
> Patch 3 fixes this (and also replaces patch 2)


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)