Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1D32710B6A for ; Fri, 13 Jun 2014 07:44:03 +0000 (UTC) Received: (qmail 97482 invoked by uid 500); 13 Jun 2014 07:44:02 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 97460 invoked by uid 500); 13 Jun 2014 07:44:02 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 97308 invoked by uid 99); 13 Jun 2014 07:44:02 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jun 2014 07:44:02 +0000 Date: Fri, 13 Jun 2014 07:44:02 +0000 (UTC) From: "Antti Lankila (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-922?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1403= 0350#comment-14030350 ]=20 Antti Lankila commented on PDFBOX-922: -------------------------------------- I do not really understand what makes you say that. Isn't subsetted font ba= sically just a wholly different font file, just having a bunch of glyphs re= moved from the original one? For instance, assuming it is a TTF file, you d= rop bunch of glyphs and then update the cmaps to reference the appropriate = glyph indexes, and then you have a new TTF file. If so, I can't see the pro= blem because you are providing all the same information as with the origina= l font, only with less glyphs included. On the other hand, I do understand that if you write the text stream using = encoding of one font, then change the definition of the TTF font without re= -encoding the text, then you definitely run into problems. But the only pos= sible way to keep CID stable is to define a standard for them, such as that= CIDs are UCS-2. This can be done, but as far as I can tell this limits cod= e points to the less than 0x10000 range because CID font writing writes 16 = bit character indexes by definition, and there is no notion of the surrogat= e pairs of UTF-16. It might not be a real problem in practice, but it's nev= ertheless a limitation that the identity mapping for glyph indexes does not= have. The only limitation of the latter approach is that single font can't= have more than 65536 glyphs. BTW, I've been quiet on this front because I solved my immediate problem by= switching to a PDF rendering library called jPod. It's not so advanced as = pdfbox, and it didn't support unicode text either, but it was possible to g= et CID keyed fonts to work on it without touching the library itself, just = through providing appropriate COS objects and setting up an encoding based = on the font's Windows Unicode cmap. I even managed to set up a working copy= paste by providing the ToUnicode postscript program, so I got everything wo= rking nicely using that 2008-era library, but I had to write most of the PD= F object factories myself. > True type PDFont subclass only supports WinAnsiEncoding (hardcoded!) > -------------------------------------------------------------------- > > Key: PDFBOX-922 > URL: https://issues.apache.org/jira/browse/PDFBOX-922 > Project: PDFBox > Issue Type: New Feature > Components: Writing > Affects Versions: 1.3.1 > Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2= .0 > Reporter: Thanos Agelatos > Assignee: Andreas Lehmk=C3=BChler > Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff > > > PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it= creates, making it impossible to create PDFs in any language apart from En= glish and ones supported in WinAnsiEncoding. This behaviour is caused becau= se method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, and = there is no Identity-H or Identity-V Encoding classes provided (to set afte= rwards via PDFont.setFont() ) > This excludes the following languages plus many others: > - Greek > - Bulgarian > - Swedish > - Baltic languages > - Malteze=20 > The PDF created contains garbled characters and/or squares. > Simple test case: > PDDocument doc =3D null; > =09=09try { > =09=09=09doc =3D new PDDocument(); > =09=09=09PDPage page =3D new PDPage(); > =09=09=09doc.addPage(page); > =09=09=09// extract fonts for fields > =09=09=09byte[] arialNorm =3D extractFont("arial.ttf"); > =09=09=09//byte[] arialBold =3D extractFont("arialbd.ttf");=20 > =09=09=09//PDFont font =3D PDType1Font.HELVETICA; > =09=09=09PDFont font =3D PDTrueTypeFont.loadTTF(doc, new ByteArrayInputSt= ream(arialNorm)); > =09=09=09 > =09=09=09PDPageContentStream contentStream =3D new PDPageContentStream(do= c, page); > =09=09=09contentStream.beginText(); > =09=09=09contentStream.setFont(font, 12); > =09=09=09contentStream.moveTextPositionByAmount(100, 700); > =09=09=09contentStream.drawString("Hello world from PDFBox =CE=B5=CE=BB= =CE=BB=CE=B7=CE=BD=CE=B9=CE=BA=CE=AC"); // text here may appear garbled; in= sert any text in Greek or Bulgarian or Malteze > =09=09=09contentStream.endText(); > =09=09=09contentStream.close(); > =09=09=09doc.save("pdfbox.pdf"); > =09=09=09System.out.println(" created!"); > =09=09} catch (Exception ioe) { > =09=09=09ioe.printStackTrace(); > =09=09} finally { > =09=09=09if (doc !=3D null) { > =09=09=09=09try { doc.close(); } catch (Exception e) {} > =09=09=09} > =09=09} -- This message was sent by Atlassian JIRA (v6.2#6252)