pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Hewson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
Date Sat, 14 Jun 2014 20:35:03 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031691#comment-14031691
] 

John Hewson edited comment on PDFBOX-922 at 6/14/14 8:34 PM:
-------------------------------------------------------------

{quote}
I do not really understand what makes you say that. Isn't subsetted font basically just a
wholly different font file, just having a bunch of glyphs removed from the original one? For
instance, assuming it is a TTF file, you drop bunch of glyphs and then update the cmaps to
reference the appropriate glyph indexes, and then you have a new TTF file. If so, I can't
see the problem because you are providing all the same information as with the original font,
only with less glyphs included.
{quote}

You said that you were using "Identity-H for charcode -> CID, and Identity for CID ->
GID", which doesn't involve updating any cmaps. If you remove glyphs from a font then the
GIDs _will_ change, and if you're using an Identity cmap then your CIDs will by definition
change also. But now you mention "update the cmaps", which isn't going to be an Identity cmap
any more... so actually you're not wanting to use an Identity cmap.

{quote}
On the other hand, I do understand that if you write the text stream using encoding of one
font, then change the definition of the TTF font without re-encoding the text, then you definitely
run into problems. But the only possible way to keep CID stable is to define a standard for
them, such as that CIDs are UCS-2
{quote}

Not necessarily, you could use a CIDToGIDMap which initially is an identity mapping but which
is updated to reflect the new GIDs once the font is subset - that's a pretty good approach.

{quote}
This can be done, but as far as I can tell this limits code points to the less than 0x10000
range because CID font writing writes 16 bit character indexes by definition, and there is
no notion of the surrogate pairs of UTF-16. It might not be a real problem in practice, but
it's nevertheless a limitation that the identity mapping for glyph indexes does not have.
The only limitation of the latter approach is that single font can't have more than 65536
glyphs.
{quote}

You had said that you wanted to use "identity CID -> GID" but you're going to need a font
with tens of thousands of empty glyphs in order to have that CID also be a valid Unicode point...
not what you want.


was (Author: jahewson):
{quote}
I do not really understand what makes you say that. Isn't subsetted font basically just a
wholly different font file, just having a bunch of glyphs removed from the original one? For
instance, assuming it is a TTF file, you drop bunch of glyphs and then update the cmaps to
reference the appropriate glyph indexes, and then you have a new TTF file. If so, I can't
see the problem because you are providing all the same information as with the original font,
only with less glyphs included.
{quote}

You said that you were using "Identity-H for charcode -> CID, and Identity for CID ->
GID", which doesn't involve updating any cmaps. If you remove glyphs from a font then the
GIDs _will_ change, and if you're using an Identity cmap then your CIDs will by definition
change also. But now you mention "update the cmaps", which isn't going to be an Identity cmap
any more... so actually you're not wanting to use an Identity cmap.

{quote}
On the other hand, I do understand that if you write the text stream using encoding of one
font, then change the definition of the TTF font without re-encoding the text, then you definitely
run into problems. But the only possible way to keep CID stable is to define a standard for
them, such as that CIDs are UCS-2
{quote}

Not necessarily, you could use a CIDToGIDMap which initially is an identity mapping but which
is updated to reflect the new GIDs one the font is subset - that's a pretty good approach.

{quote}
This can be done, but as far as I can tell this limits code points to the less than 0x10000
range because CID font writing writes 16 bit character indexes by definition, and there is
no notion of the surrogate pairs of UTF-16. It might not be a real problem in practice, but
it's nevertheless a limitation that the identity mapping for glyph indexes does not have.
The only limitation of the latter approach is that single font can't have more than 65536
glyphs.
{quote}

You had said that you wanted to use "identity CID -> GID" but you're going to need a font
with tens of thousands of empty glyphs in order to have that CID also be a valid Unicode point...
not what you want.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making
it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding.
This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding
inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards
via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
> 		try {
> 			doc = new PDDocument();
> 			PDPage page = new PDPage();
> 			doc.addPage(page);
> 			// extract fonts for fields
> 			byte[] arialNorm = extractFont("arial.ttf");
> 			//byte[] arialBold = extractFont("arialbd.ttf"); 
> 			//PDFont font = PDType1Font.HELVETICA;
> 			PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm));
> 			
> 			PDPageContentStream contentStream = new PDPageContentStream(doc, page);
> 			contentStream.beginText();
> 			contentStream.setFont(font, 12);
> 			contentStream.moveTextPositionByAmount(100, 700);
> 			contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here
may appear garbled; insert any text in Greek or Bulgarian or Malteze
> 			contentStream.endText();
> 			contentStream.close();
> 			doc.save("pdfbox.pdf");
> 			System.out.println(" created!");
> 		} catch (Exception ioe) {
> 			ioe.printStackTrace();
> 		} finally {
> 			if (doc != null) {
> 				try { doc.close(); } catch (Exception e) {}
> 			}
> 		}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message