xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Brown <m...@skew.org>
Subject Re: Character set problems
Date Tue, 07 Mar 2000 02:25:28 GMT
> The idea is that the end-user will use an HTML form to enter information
> (that may contain reserved xml characters like apostrophes, quotes, GTs,
> LTs, ampersands, etc).
> [...]
> some funky international A character.

I was just researching this issue.

My company has a need to use HTTP to transfer XML docs, and sometimes we
want an HTML form to be the input source. Unfortunately it is looking
like there's no way to know what the encoding is of something coming
in from an HTML form, with current implementations, so we are going to
just be using POSTs with "raw" XML outside of the context of form data
submissions wherever we can.

The sad fact is that HTML forms and browser implementations of them are
inconsistent and will make your life miserable.

First, give up on urlencoded forms. You have to do multipart/form-data
submissions with the POST method. URL encoding is limited to 256
characters, and browser implementations are buggy. You'll very likely
get bytes encoded in there that are specific to the charset used to
display the form.

Second, even with the new method, see what happens when you submit the 
same form and the same data (make it good data with some funky characters
like fat bullets, i.e. U+2022) with your browser set to use different
encodings. Same deal as with URL encoding, although not quite as bad
since you can in some cases get the wonky chars to go through.

Next, observe the fact that no charset parameters are supplied with the
media types in the form data submission. That's right, there's no way
to be sure what is coming in from the form. Is it MacRoman? Windows-1252?
ISO 8859-1? UTF-8? KOI8-R? No way to know for sure.


The international A character you're talking about is indicative of
UTF-8 encoding. If you look at the data with a hex editor you'll 
probably see 2 bytes there, the first of which is the one that looks
like the funky A in your non-UTF-8-aware environment. The pair of
bytes is the UTF-8 encoded form of a character outside of the 0x0..0x7F
Unicode range.

   - Mike
________________________________________________________________________
 Mike Brown / Hyperreal   |  Hyperreal http://www.hyperreal.org/music/
 PO Box 61334             |     XML & XSL http://www.skew.org/xml/
 Denver CO 80206-8334 USA |       http://www.hyperreal.org/~mike/

Mime
View raw message