pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cornelis Hoeflake <c.hoefl...@postex.com>
Subject Weird behaviour PDF from Word for Mac
Date Wed, 02 Apr 2014 12:55:44 GMT
Hi,

When I execute PDFStreamEngine.processStream(PDPage, PDResources,
COSStream) I see very weird behaviour on the TextPosition's.
Every TextPosition which has to be a 'space' exists of multiple characters
(TextPosition.getCharacter):
9, 13, 32, 160

When I look in the code for filling the cmap (via debugger) of the font, I
see a byte array of:
[0, 9, 0, 13, 0, 32, 0, -96] which is interpreted as a String with UTF-16BE
encoding. Huh? -96?

Copy paste the text on Windows via Adobe Reader 'adds' newline on every
space (paste to notepad).

Repoduce:
Simple document created in Word for Mac (newest version) and using font
Cambria. The document contains only 'a a'. Saving the document as PDF (via
Save-As).

When using the font Verdana in stead of Cambria the problem NOT exists.
Doing the same on Word for Windows, the problem NOT exists.

So my conclusion is that it is an issue on Word for Mac with the Cambria
font. Can anyone confirm that?

But next, my PDFBox code has to handle it correctly. What is a safe
assumption? Can I safely assume that when multiple characters are returned
from TextPosition.getCharacter this can be ignored? Or look for specific
byte order ending with the -96?

Kind regards,
Cornelis Hoeflake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message