Return-Path: Delivered-To: apmail-jakarta-poi-user-archive@www.apache.org Received: (qmail 77544 invoked from network); 6 Dec 2006 10:36:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Dec 2006 10:36:16 -0000 Received: (qmail 14197 invoked by uid 500); 6 Dec 2006 10:36:24 -0000 Delivered-To: apmail-jakarta-poi-user-archive@jakarta.apache.org Received: (qmail 13758 invoked by uid 500); 6 Dec 2006 10:36:22 -0000 Mailing-List: contact poi-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Help: List-Post: List-Id: "POI Users List" Reply-To: "POI Users List" Delivered-To: mailing list poi-user@jakarta.apache.org Received: (qmail 13747 invoked by uid 99); 6 Dec 2006 10:36:22 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Dec 2006 02:36:22 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [81.187.40.70] (HELO fluffy.torchbox.com) (81.187.40.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Dec 2006 02:36:11 -0800 Received: from grenache.internal.torchbox.com ([192.168.1.81]) by fluffy.torchbox.com with esmtp (Exim 4.50) id 1Gru7x-0006d2-Ov for poi-user@jakarta.apache.org; Wed, 06 Dec 2006 10:35:50 +0000 Date: Wed, 6 Dec 2006 10:36:41 +0000 (GMT) From: Nick Burch X-X-Sender: nick@localhost.localdomain To: POI Users List Subject: Re: PPT Unicode In-Reply-To: <4575DC24.3030107@great.ufc.br> Message-ID: References: <4575DC24.3030107@great.ufc.br> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Spam-Score: -103.3 (---------------------------------------------------) X-Virus-Checked: Checked by ClamAV on apache.org On Tue, 5 Dec 2006, Tales Paiva Nogueira wrote: > When PowerPoint stores text in Unicode a unknown char (byte value = 0) > is placed between every "normal" char making the text 2 times longer > than it really is. TextCharsAtoms, and other unicode containing fields in powerpoint files, are stored as UTF-16. That means two bytes are used to store every character. US-ASCII will be stored with the second byte zero, but other characters will need to make some use of the second byte. If you call getText() on a TextCharsAtom, it'll convert it to a string for you. You should really be using that, not getting the bytes directly. > Is there any way to keep the style information and get the text as a > TextByteAtom, instead of TextCharsAtom? Why? PowerPoint decided to make it a TextCharsAtom, rather than a TextByteAtom, since your string contained at least one character that couldn't be represented in a TextByteAtom. HSLF supports upgrading a TextByteAtom to a TextCharsAtom if you try to set text that can't be held in a TextByteAtom. It doesn't do the other way around. If you really want just the low order bytes, call getText() on the TextCharsAtom, and mangle the string yourself. Not sure why you'd want to though.... Nick --------------------------------------------------------------------- To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/