poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <n...@torchbox.com>
Subject Re: PPT Unicode
Date Wed, 06 Dec 2006 10:36:41 GMT
On Tue, 5 Dec 2006, Tales Paiva Nogueira wrote:
> When PowerPoint stores text in Unicode a unknown char (byte value = 0) 
> is placed between every "normal" char making the text 2 times longer 
> than it really is.

TextCharsAtoms, and other unicode containing fields in powerpoint files, 
are stored as UTF-16. That means two bytes are used to store every 
character. US-ASCII will be stored with the second byte zero, but other 
characters will need to make some use of the second byte.

If you call getText() on a TextCharsAtom, it'll convert it to a string for 
you. You should really be using that, not getting the bytes directly.


> Is there any way to keep the style information and get the text as a 
> TextByteAtom, instead of TextCharsAtom?

Why? PowerPoint decided to make it a TextCharsAtom, rather than a 
TextByteAtom, since your string contained at least one character that 
couldn't be represented in a TextByteAtom.

HSLF supports upgrading a TextByteAtom to a TextCharsAtom if you try to 
set text that can't be held in a TextByteAtom. It doesn't do the other way 
around.


If you really want just the low order bytes, call getText() on the 
TextCharsAtom, and mangle the string yourself. Not sure why you'd want to 
though....

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Mime
View raw message