poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <acoli...@apache.org>
Subject Re: Bad Data Issue in HSSF: Fix Proposed
Date Thu, 30 Jan 2003 16:54:20 GMT
It sounds like you know what you're talking about.  Do it and submit
assocaited unit tests.  Please make sure you test with some non-english test
files (ask for some german, spanish, japanese, russian, korean and chinese
as you'll get the most contributions as we have the largest user bases of
those particular languages).

Pay attention to the "get involved" page so that your patch can be easily
applied.  (our principal method of selecting patches is by picking thte ones
which are submitted as we request, if the person took due care, then its a
reasonable expectation that the code will reflect this).

-Andy

-Andy
----- Original Message -----
From: "Carey Sublette" <carey.sublette@overture.com>
To: <poi-dev@jakarta.apache.org>
Cc: "Johnson Marc (E-mail)" <marc_johnson27591@hotmail.com>
Sent: Thursday, January 30, 2003 11:04 AM
Subject: Bad Data Issue in HSSF: Fix Proposed


> This is an admittedly uncommon issue, but I encounter a problem from time
to
> time involving bad character values in xls files.
>
> According to the Excel97 format, text data is supposed to be in Unicode.
> There are 32 values in ISO-8859-1 (Latin-1) that are invalid, from 0x80 to
> 0x9F, and of course the same is true for the superset of Latin-1 which is
> Unicode.
>
> The Windows platform though has long used an 8 bit encoding called
> windows-1252 (the IANA official designation, also called codepage-1252),
> which extended Latin-1 by assigning 27 values in this range to symbols
that
> have a 16 bit representation in Unicode (see
> http://www.microsoft.com/globaldev/reference/sbcs/1252.htm for the spec).
> See also http://www.iana.org/assignments/character-sets and
> http://www.iana.org/assignments/charset-reg/windows-1252 for reference.
>
> Apparently some versions of Microsoft software (software and OS versions
> unknown) make it possible to generate Excel97 files containing
codepage-1252
> character values. This causes problems when the data is passed on to other
> processes. At the very least, the intended symbol is lost (replaced by a
> "?") but it can also pass the invalid values on (depending of whether the
> text string contained only 8 bit encodings or not) to cause mischief
> elsewhere.
>
> I encounter this in an environment where we import xls files submitted by
> customers through the internet, which leads to a very mixed bag of files
> showing up.
>
> I propose a patch to:
> org.apache.poi.hssf.record.UnicodeString.fillFields
> where the characters are actually copied from the byte buffer to
> field_3_string.
>
> In the case of 8 bit strings (the grbit is 0) the patch would be to use
the
> encoding "Cp1252" instead of  "ISO-8859-1" as the default. Since as
defined
> windows-1252 is a proper superset of Latin-1 this should always work. Pure
> Latin-1 text will be properly translated, and so will windows-1252 (the 27
> symbols not found in Latin-1 will become Unicode values with a non-zero
high
> byte).
>
> Alternatively, the 8 bit string could be scanned for bytes in the range
> 0x80-0x9F and if found the String constructor would be passed the encoding
> "Cp1252" instead of "ISO-8859-1". The only reason for doing it this way is
> would be if one theorized that deviant implementations on some version of
> Java existed, and sought to reduce exposure to this.
>
> For the 16 bit strings the patch would be to filter the characters as the
> copy it done to explicitly translate the 27 characters if found.
>
> There shouldn't be any issues going the other way, that is, writing
Excel97
> documents. Since Unicode is the documented standard (according to pg. 264
of
> the Excel97 Developer's Kit) there would never be a need to generate
> windows-1252 output.
>
> I am implementing my recommended patch right now for my own use, and will
> submit it for incorporation into Poi if you the team is interested.
>
> Carey Sublette
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
>


Mime
View raw message