Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@pdfbox.apache.org
Date: Fri, 13 Jun 2014 07:44:02 +0000 (UTC)
From: "Antti Lankila (JIRA)" <jira@apache.org>
To: dev@pdfbox.apache.org
Message-ID: <JIRA.12493366.1292449328807.125174.1402645442860@arcas>
In-Reply-To: <JIRA.12493366.1292449328807@arcas>
References: <JIRA.12493366.1292449328807@arcas>
Subject: [jira] [Commented] (PDFBOX-922) True type PDFont subclass only
 supports WinAnsiEncoding (hardcoded!)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/PDFBOX-922?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1403=
0350#comment-14030350 ]=20

Antti Lankila commented on PDFBOX-922:
--------------------------------------

I do not really understand what makes you say that. Isn't subsetted font ba=
sically just a wholly different font file, just having a bunch of glyphs re=
moved from the original one? For instance, assuming it is a TTF file, you d=
rop bunch of glyphs and then update the cmaps to reference the appropriate =
glyph indexes, and then you have a new TTF file. If so, I can't see the pro=
blem because you are providing all the same information as with the origina=
l font, only with less glyphs included.

On the other hand, I do understand that if you write the text stream using =
encoding of one font, then change the definition of the TTF font without re=
-encoding the text, then you definitely run into problems. But the only pos=
sible way to keep CID stable is to define a standard for them, such as that=
 CIDs are UCS-2. This can be done, but as far as I can tell this limits cod=
e points to the less than 0x10000 range because CID font writing writes 16 =
bit character indexes by definition, and there is no notion of the surrogat=
e pairs of UTF-16. It might not be a real problem in practice, but it's nev=
ertheless a limitation that the identity mapping for glyph indexes does not=
 have. The only limitation of the latter approach is that single font can't=
 have more than 65536 glyphs.

BTW, I've been quiet on this front because I solved my immediate problem by=
 switching to a PDF rendering library called jPod. It's not so advanced as =
pdfbox, and it didn't support unicode text either, but it was possible to g=
et CID keyed fonts to work on it without touching the library itself, just =
through providing appropriate COS objects and setting up an encoding based =
on the font's Windows Unicode cmap. I even managed to set up a working copy=
paste by providing the ToUnicode postscript program, so I got everything wo=
rking nicely using that 2008-era library, but I had to write most of the PD=
F object factories myself.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2=
.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmk=C3=BChler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it=
 creates, making it impossible to create PDFs in any language apart from En=
glish and ones supported in WinAnsiEncoding. This behaviour is caused becau=
se method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, and =
there is no Identity-H or Identity-V Encoding classes provided (to set afte=
rwards via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze=20
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc =3D null;
> =09=09try {
> =09=09=09doc =3D new PDDocument();
> =09=09=09PDPage page =3D new PDPage();
> =09=09=09doc.addPage(page);
> =09=09=09// extract fonts for fields
> =09=09=09byte[] arialNorm =3D extractFont("arial.ttf");
> =09=09=09//byte[] arialBold =3D extractFont("arialbd.ttf");=20
> =09=09=09//PDFont font =3D PDType1Font.HELVETICA;
> =09=09=09PDFont font =3D PDTrueTypeFont.loadTTF(doc, new ByteArrayInputSt=
ream(arialNorm));
> =09=09=09
> =09=09=09PDPageContentStream contentStream =3D new PDPageContentStream(do=
c, page);
> =09=09=09contentStream.beginText();
> =09=09=09contentStream.setFont(font, 12);
> =09=09=09contentStream.moveTextPositionByAmount(100, 700);
> =09=09=09contentStream.drawString("Hello world from PDFBox =CE=B5=CE=BB=
=CE=BB=CE=B7=CE=BD=CE=B9=CE=BA=CE=AC"); // text here may appear garbled; in=
sert any text in Greek or Bulgarian or Malteze
> =09=09=09contentStream.endText();
> =09=09=09contentStream.close();
> =09=09=09doc.save("pdfbox.pdf");
> =09=09=09System.out.println(" created!");
> =09=09} catch (Exception ioe) {
> =09=09=09ioe.printStackTrace();
> =09=09} finally {
> =09=09=09if (doc !=3D null) {
> =09=09=09=09try { doc.close(); } catch (Exception e) {}
> =09=09=09}
> =09=09}


--
This message was sent by Atlassian JIRA
(v6.2#6252)