poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <n...@torchbox.com>
Subject Re: PPT Unicode
Date Wed, 06 Dec 2006 10:36:41 GMT
On Tue, 5 Dec 2006, Tales Paiva Nogueira wrote:
> When PowerPoint stores text in Unicode a unknown char (byte value = 0) 
> is placed between every "normal" char making the text 2 times longer 
> than it really is.

TextCharsAtoms, and other unicode containing fields in powerpoint files, 
are stored as UTF-16. That means two bytes are used to store every 
character. US-ASCII will be stored with the second byte zero, but other 
characters will need to make some use of the second byte.

If you call getText() on a TextCharsAtom, it'll convert it to a string for 
you. You should really be using that, not getting the bytes directly.

> Is there any way to keep the style information and get the text as a 
> TextByteAtom, instead of TextCharsAtom?

Why? PowerPoint decided to make it a TextCharsAtom, rather than a 
TextByteAtom, since your string contained at least one character that 
couldn't be represented in a TextByteAtom.

HSLF supports upgrading a TextByteAtom to a TextCharsAtom if you try to 
set text that can't be held in a TextByteAtom. It doesn't do the other way 

If you really want just the low order bytes, call getText() on the 
TextCharsAtom, and mangle the string yourself. Not sure why you'd want to 


To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

View raw message