Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E1E8101A1 for ; Fri, 3 Jan 2014 19:00:07 +0000 (UTC) Received: (qmail 44945 invoked by uid 500); 3 Jan 2014 19:00:05 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 44837 invoked by uid 500); 3 Jan 2014 19:00:04 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 44652 invoked by uid 99); 3 Jan 2014 19:00:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 19:00:04 +0000 Date: Fri, 3 Jan 2014 19:00:04 +0000 (UTC) From: "John Hewson (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (PDFBOX-1824) [PATCH] CFF fonts render wrong glyphs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-1824?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-1824: -------------------------------- Attachment: bimbo_historia-patched.jpg I've attached "bimbo_historia-patched.jpg" which shows the result of the ne= w patch. Note that some of the glyphs are missing, which I can confirm is a= problem related to parsing Type1 CharStrings and is *not* an Encoding issu= e. I will open a new bug for this with more details once the current issue = is closed. > [PATCH] CFF fonts render wrong glyphs > ------------------------------------- > > Key: PDFBOX-1824 > URL: https://issues.apache.org/jira/browse/PDFBOX-1824 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.0 > Reporter: John Hewson > Assignee: Andreas Lehmk=C3=BChler > Labels: patch > Fix For: 2.0.0 > > Attachments: 1.patch, 2.patch, 3.patch, Bimbo_Historia_20070409_E= sp.pdf-2-rev-1554775.png, Bimbo_Historia_20070409_Esp.pdf-2-rev-current.png= , all.patch, bimbo_historia-patched.jpg, bimbo_historia.patch, calluna-11.p= df, patched.jpg, trunk.jpg > > > I've found three very closely related CFF encoding issues in v2.0.0 when = using PDFToImage. > Problem 1 > --------- > Look a line 7 of the poem, it should be "And the mouldering dust that yea= rs have made" > but instead says "Afld the fioulderiflg dust that years have fiade" > The CFF font is asseumed to use CIDs but it does not if its not a ROS fon= t. > Therefore we add a check for CFF ROS class. > Patch 1 fixes this. > Problem 2 > --------- > Look at line 3 "of right shoice" should be "of right choice". > Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a = staunch", > the st and ch ligatures are incorrect. > This is because the font is an CFF ROS CID Font and the glyphs for the st= and ch ligatures > both have no name. The CFF format achieves this by using SIDs beyond the = size of the string > index, which map to .notdef. So there is a unique SID for each glyph, but= not a unique name. > Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique na= mes, and this > assumtion appears throughout the codebase. Because a glyph name and a SID= perform essentially > the same role, I recommend a simple solution to the problem: when an SID = beyond the size of > the string index is encounteted, instead of mapping it to .notdef it shou= ld be mapped to=20 > a new name with the prefix "SID" for example mapping SID 409 to the name = "SID409". That way > each glyph will have a unique name, which is what PDFbox assumes. > Patch 2 fixes this. > Problem 3 > --------- > Look at line 2, "That creepeth o=C3=89er ruins old!" the word "o'er" is i= ncorrectly rendered > as "o=C3=89er". This is because the Encoding entry in the PDF maps code 2= 01 from "Eacute" in the > base encoding to "quoteright", but this is being ignored by PDFBox. > In the CFFGlyph2D constructor PDFBox examines the font's built-in charset= . When the name > "quoteright" is encountered it is looked up in the PDF Encoding (i.e. nam= eToCode) where > it is changed to code 201. Thus code 201 is associated with the "quoterig= ht" glyph in the > codeToGlyph map. This is correct.=20 > However, later when the "Eacute" glyph is encountered, its built-in chars= et code is also > 201 (which is standard) and so the codeToGlyph map entry is overwritten, = resulting in > code 201 being associated with the "Eacute" glyph.=20 > The solution is to build the codeToGlyph map in a strict order: first pop= ulate it with the > font's built-in charset, then the PDF Encoding overwrites any entries whi= ch it defines. > Patch 3 fixes this (and also replaces patch 2) -- This message was sent by Atlassian JIRA (v6.1.5#6160)