pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antti Lankila (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
Date Mon, 16 Jun 2014 11:13:02 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032334#comment-14032334

Antti Lankila commented on PDFBOX-922:

Ah... there are multiple ways to understand what "identity mapping" meant. I've been using
it in sense that PDF standard uses: that Identity means f(x) = x, and that implies that once
you have CIDToGIDMap as Identity and Encoding as Identity-H, then all the character codes
and CIDs are just GIDs. When I discuss about the possibility that CID values would be constrained
to be valid Unicode code points, I use some phrasing such as "CIDs are UCS-2". In this case,
of course, we would still have Identity-H mapping at the character code -> CID layer, but
not at the CID to GID layer.

I believe that the notion of subsetting fonts is not a problem as long as subsetting is not
done after the fact by replacing the FontFile parameter. (Or if it is, then CIDToGIDMap must
be provided that matches the new glyph IDs, as you pointed out.)

Of course, this only applies to truetype fonts. Some font types apparently defined CIDs to
have a particular meaning, and they come with their own CID to GID programs. I assume such
fonts also provide a meaning for CID that we could use, such as the unicode value or postscript
name for the CID, or some predefined encoding map that defines all valid CIDs and their interpretation.

You are right that the CMap will control the code length. I also can't see any good reason
to generate but 16-bit characters -- all that matters is that indexing all the glyphs is possible
and I'm going to guess that there are no non-composite fonts that have more than 65536 glyphs,
so that makes things simple on the generating side. However, existing PDF files could have
combined single/multibyte CMaps, which are then required to have no possibility to confuse
which CMap is in use so the ranges going for 8-bit codes can't be used as the prefix for 16-bit
codes, and so on. Rather complicated and I doubt that the current code (which is also pretty
ugly to look at) is handling things correctly -- CodespaceRanges are not sorted by length
as far as I can see.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making
it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding.
This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding
inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards
via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
> 		try {
> 			doc = new PDDocument();
> 			PDPage page = new PDPage();
> 			doc.addPage(page);
> 			// extract fonts for fields
> 			byte[] arialNorm = extractFont("arial.ttf");
> 			//byte[] arialBold = extractFont("arialbd.ttf"); 
> 			//PDFont font = PDType1Font.HELVETICA;
> 			PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm));
> 			PDPageContentStream contentStream = new PDPageContentStream(doc, page);
> 			contentStream.beginText();
> 			contentStream.setFont(font, 12);
> 			contentStream.moveTextPositionByAmount(100, 700);
> 			contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here
may appear garbled; insert any text in Greek or Bulgarian or Malteze
> 			contentStream.endText();
> 			contentStream.close();
> 			doc.save("pdfbox.pdf");
> 			System.out.println(" created!");
> 		} catch (Exception ioe) {
> 			ioe.printStackTrace();
> 		} finally {
> 			if (doc != null) {
> 				try { doc.close(); } catch (Exception e) {}
> 			}
> 		}

This message was sent by Atlassian JIRA

View raw message