pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antti Lankila (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
Date Fri, 13 Jun 2014 09:08:12 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030399#comment-14030399
] 

Antti Lankila edited comment on PDFBOX-922 at 6/13/14 9:04 AM:
---------------------------------------------------------------

Anyway, let's take a look at the changes required in PDFBox to get the text writing to work
properly.

- drawString() in PDPageContentStream just writes the text into PDF as any COSString would
choose to represent it. This is not the right thing to do. When the font is a CID keyed font,
every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write
it correctly. Therefore, drawString() must know what font is currently being drawn, and ask
that font to encode the String to whatever byte sequence it takes to draw those glyphs. So,
PDFont must be added to the drawString() API, and PDFont ought to have a method for "public
byte[] encode(String)". I would suggest encoding displayable text always as (<hex chars>)
sequences because this encoding is simplest to implement and the easiest to make bug free.

- PDFont needs a clearly specified API which performs java String to font-specific encoding
transformation. The process is usually called encoding, and yields a byte array, and the reverse
process of taking a byte array and interpreting it to String is called decoding. Observe that
there are no methods in PDFont called decode(), and I have a hard time figuring out what any
one of these methods actually do, because everything seems to be called "encode" or "lookup".
It seems that the encode(byte[], int int) performs decoding, so it should be renamed such.
In general I'd recommend pushing the encode/decode job down to the font layer. Provide just
two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert
between the byte sequences required by that font and java Strings, and they handle full runs
of text, not just single characters. They will then use single- or multibyte encodings as
the font requires without the higher level having to do crazy stuff like processEncodedText()
currently does in PDFStreamEngine.

- When implementing encoding, never ask for the char[] array of a Java String. Instead, "for
(int i = 0, cp; i < string.length(); i += Character.charCount(cp)) { cp = string.codePointAt(i);
... now encode the codepoint ... }". This will handle the UTF-16 surrogate pairs correctly.

- There are unfortunately very many ways to encode text in PDF, and especially if text needs
to be decodable from the byte stream generated by other programs, the full complexity must
be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy.
The PDFont highest class methods for encode and decode should be defined as abstract to reflect
the fact that encoding depends on the particular subtype of the font. It seems that Type1,
TrueType, Type3, and CIDType0 and CIDType2 fonts require different handling from each other.
It may be that for some of these fonts the implementation is same because the actual mechanics
can be handled by varying the Encoding instance, though.


was (Author: alankila@bel.fi):
Anyway, let's take a look at the changes required in PDFBox to get the text writing to work
properly.

- drawString() in PDPageContentStream just writes the text into PDF as any COSString would
choose to represent it. This is not the right thing to do. When the font is a CID keyed font,
every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write
it correctly. Therefore, drawString() must know what font is currently being drawn, and ask
that font to encode the String to whatever byte sequence it takes to draw those glyphs. So,
PDFont must be added to the drawString() API, and PDFont ought to have a method for "public
byte[] encode(String)". I would suggest encoding displayable text always as (<hex chars>)
sequences because this encoding is simplest to implement and the easiest to make bug free.

- PDFont needs a clearly specified API which performs java String to unicode encoding transformation.
The process is usually called encoding, and the reverse process of taking a byte array and
interpreting it to String is called decoding. Observe that there are no methods in PDFont
called decode(), and I have a hard time figuring out what any one of these methods actually
do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte[],
int int) performs decoding, so it should be renamed such. In general I'd recommend pushing
the encode/decode job down to the font layer. Provide just two methods: "byte[] encode(String)"
and "String decode(byte[])". Their job is to convert between the byte sequences required by
that font and java Strings, and they handle full runs of text, not just single characters.
They will then use single- or multibyte encodings as the font requires without the higher
level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine.

- When implementing encoding, never ask for the char[] array of a Java String. Instead, "for
(int i = 0, cp; i < string.length(); i += Character.charCount(cp)) { cp = string.codePointAt(i);
... now encode the codepoint ... }". This will handle the UTF-16 surrogate pairs correctly.

- There are unfortunately very many ways to encode text in PDF, and especially if text needs
to be decodable from the byte stream generated by other programs, the full complexity must
be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy.
The PDFont highest class methods for encode and decode should be defined as abstract to reflect
the fact that encoding depends on the particular subtype of the font. It seems that Type1,
TrueType, Type3, and CIDType0 and CIDType2 fonts require different handling from each other.
It may be that for some of these fonts the implementation is same because the actual mechanics
can be handled by varying the Encoding instance, though.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
>
>
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making
it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding.
This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding
inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards
via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
> 		try {
> 			doc = new PDDocument();
> 			PDPage page = new PDPage();
> 			doc.addPage(page);
> 			// extract fonts for fields
> 			byte[] arialNorm = extractFont("arial.ttf");
> 			//byte[] arialBold = extractFont("arialbd.ttf"); 
> 			//PDFont font = PDType1Font.HELVETICA;
> 			PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm));
> 			
> 			PDPageContentStream contentStream = new PDPageContentStream(doc, page);
> 			contentStream.beginText();
> 			contentStream.setFont(font, 12);
> 			contentStream.moveTextPositionByAmount(100, 700);
> 			contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here
may appear garbled; insert any text in Greek or Bulgarian or Malteze
> 			contentStream.endText();
> 			contentStream.close();
> 			doc.save("pdfbox.pdf");
> 			System.out.println(" created!");
> 		} catch (Exception ioe) {
> 			ioe.printStackTrace();
> 		} finally {
> 			if (doc != null) {
> 				try { doc.close(); } catch (Exception e) {}
> 			}
> 		}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message