poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject [Bug 57008] Wrting _x0427_ to a string cell changes the string to some strange UTF-8 character
Date Wed, 17 Jan 2018 20:45:53 GMT
https://bz.apache.org/bugzilla/show_bug.cgi?id=57008

--- Comment #18 from Greg Woolsey <gwoolsey@apache.org> ---
I always go back to the standards doc when I get going around in circles. 
Here's what it says about escaped strings:

22.4.2.4 bstr (Basic String)
This element defines a binary basic string variant type, which can store any
valid Unicode character. Unicode characters that cannot be directly represented
in XML as defined by the XML 1.0 specification, shall be escaped using the
Unicode numerical character representation escape character format _xHHHH_,
where H represents a hexadecimal character in the character's value. [Example:
The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be
escaped as _x0008_. end example] To store the literal form of an escape
sequence, the initial underscore shall itself be escaped (i.e. stored as
_x005F_). [Example: The string literal _x0008_ would be stored as
_x005F_x0008_. end example]

The possible values for this element are defined by the W3C XML Schema string
datatype.

I think POI should assume it needs to escape Unicode when setting CT* class
value strings, and unescape when reading them.  I don't think POI should be
attempting to unescape them at any other time than when reading a string value
from a CT* class.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message