pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Hewson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-922) True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
Date Sat, 14 Jun 2014 20:55:03 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031695#comment-14031695

John Hewson commented on PDFBOX-922:

drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose
to represent it. This is not the right thing to do. When the font is a CID keyed font, every
glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly.

Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will
fit inside 8 bits.

Therefore, drawString() must know what font is currently being drawn, and ask that font to
encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must
be added to the drawString() API, and PDFont ought to have a method for "public byte[] encode(String)".

drawString() is only valid after setFont() has been called, so it doesn't need adding to the
API, we can just use the current font. PDFont#encode is a good idea, yes.

PDFont needs a clearly specified API which performs java String to font-specific encoding

Yes, as above.

Observe that there are no methods in PDFont called decode(), and I have a hard time figuring
out what any one of these methods actually do, because everything seems to be called "encode"
or "lookup". It seems that the encode(byte[], int int) performs decoding, so it should be
renamed such.

Yes, I don't know if anybody knows what methods are actually doing, including the original

In general I'd recommend pushing the encode/decode job down to the font layer. Provide just
two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert
between the byte sequences required by that font and java Strings, and they handle full runs
of text, not just single characters. They will then use single- or multibyte encodings as
the font requires without the higher level having to do crazy stuff like processEncodedText()
currently does in PDFStreamEngine.

processEncodedText() is indeed crazy and needs fixing, but what you propose won't work because
the 16-bit string encoding is not set by the font, it's set on a per-string basis by having
that string start with a BOM.

There are unfortunately very many ways to encode text in PDF, and especially if text needs
to be decodable from the byte stream generated by other programs, the full complexity must
be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy.
The PDFont highest class methods for encode and decode should be defined as abstract to reflect
the fact that encoding depends on the particular subtype of the font.

Yes, though as far as decoding the correct text is concerned all you have to do is make sure
that the ToUnicode map is built correctly - you can put any old garbage in the actual strings
(any many PDFs do). 

It may be that for some of these fonts the implementation is same because the actual mechanics
can be handled by varying the Encoding instance, though.

Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. Type1C) only.

> True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)
> --------------------------------------------------------------------
>                 Key: PDFBOX-922
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-922
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Writing
>    Affects Versions: 1.3.1
>         Environment: JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0
>            Reporter: Thanos Agelatos
>            Assignee: Andreas Lehmkühler
>         Attachments: pdfbox-unicode.diff, pdfbox-unicode2.diff
> PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making
it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding.
This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding
inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards
via PDFont.setFont() )
> This excludes the following languages plus many others:
> - Greek
> - Bulgarian
> - Swedish
> - Baltic languages
> - Malteze 
> The PDF created contains garbled characters and/or squares.
> Simple test case:
>                 PDDocument doc = null;
> 		try {
> 			doc = new PDDocument();
> 			PDPage page = new PDPage();
> 			doc.addPage(page);
> 			// extract fonts for fields
> 			byte[] arialNorm = extractFont("arial.ttf");
> 			//byte[] arialBold = extractFont("arialbd.ttf"); 
> 			//PDFont font = PDType1Font.HELVETICA;
> 			PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm));
> 			PDPageContentStream contentStream = new PDPageContentStream(doc, page);
> 			contentStream.beginText();
> 			contentStream.setFont(font, 12);
> 			contentStream.moveTextPositionByAmount(100, 700);
> 			contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here
may appear garbled; insert any text in Greek or Bulgarian or Malteze
> 			contentStream.endText();
> 			contentStream.close();
> 			doc.save("pdfbox.pdf");
> 			System.out.println(" created!");
> 		} catch (Exception ioe) {
> 			ioe.printStackTrace();
> 		} finally {
> 			if (doc != null) {
> 				try { doc.close(); } catch (Exception e) {}
> 			}
> 		}

This message was sent by Atlassian JIRA

View raw message