poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toshiaki Kamoshida <kamoshida.toshi...@future.co.jp>
Subject Re: sheet names and string format read garbled on EBCDIC machine
Date Mon, 07 Apr 2003 09:35:43 GMT

I have some knowledge about encoding problems in POI,and things Japan.
So I describe it.I'm glad if it helps your discussion :)

>Do you mean the the Excel file itself knows and can communicate what
>encoding is used? Is it somewhere in its record structure? Otherwise,
>how are you going to know what encoding to use, unless you let the
>client application pass it in?

I don't have the Bible of Excel files format :P,but I found there is
in each Record object that contains string in my local experiments.
there is 2 cases.

1.16Bit Unicode(like UTF-16LE you are naming ENCODING_UTF_16)
2.16Bit Unicode,but the high byte of each characters cutted off
This may be simply ISO-8859-1.I don't know how you should manage
the codes U+00A0-U+00FF in XLS format XP

Perhaps,in engilsh locale,using case2.and in other regions,
using case1.

BoundSheetRecord#field_4_compressed_unicode_flag indicate it.
In HeaderRecord and FooterRecord,it is not implemented,
(Now I submitted a patch Bug 17039 because it is only a simple
bug,as a lack of the implementation:D)but there is a byte as a 
flag to indicate it.
I don't know things about the values in each cells XP

And,in Java process,character encoding should be 16bitUnicode.
It is in the specification of Java.
Even if a String object contains characters that is not 
16bitUnicode,it is NOT YOUR BUSINESS.The responsibility to 
operate it as correct semantic is in each application's developers.
We Japanese are often using local encoding rules called Shift_JIS
or EUC_JP or JIS(or Windows-31J XP).When we make a Java process,
we often use java.io.Reader or java.io.Writer to change encoding 
between native rule and 16BitUnicode.We often use other way to 
manage the encoding of Strings.So, what you should do is to say
"POI will accept only true 16BitUnicode"(and fix small bugs:D),
I feel.

Now I'm using POI with a lot of Japanese String(16BitUnicode),
It woks good(except HSSFHeader & HSSFFooter:)).


> -----Original Message-----
> From: Andrew C. Oliver [mailto:acoliver@apache.org] 
> Sent: Thursday, April 03, 2003 1:30 PM
> To: POI Developers List
> Cc: 'POI Users List'
> Subject: Re: sheet names and string format read garbled on EBCDIC
> machine
> Quite possibly.  Good point.  Perhaps you can work with some of the 
> Japanese folks on the list in order to create appropriate patches/unit 
> tests.
> Remember, its not only the right encoding thats at work, but what Excel 
> will accept..


Toshiaki Kamoshida


View raw message