httpd-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject [users@httpd] what is the charset of a URL ?
Date Sat, 07 Feb 2009 21:30:56 GMT
Hi.

I have been wondering for a while about how a server application should 
really consider the "query string" part of a URL, in terms of character 
encoding.  I am talking here of a URL of the form
http://hostname/somepath?name1=value1&name2=value2..&nameN=valueN
(the part after the question mark)

Starting with a quote from
http://www.w3.org/TR/html401/interact/forms.html#h-17.3 :

accept-charset = charset list [CI]
     This attribute specifies the list of character encodings for input 
data that is accepted by the server processing this form. The value is a 
space- and/or comma-delimited list of charset values. The client must 
interpret this list as an exclusive-or list, i.e., the server is able to 
accept any single character encoding per entity received.
     The default value for this attribute is the reserved string 
"UNKNOWN". User agents may interpret this value as the character 
encoding that was used to transmit the document containing this FORM 
element.

Some people (to which I belong), after trying to digest the various RFCs 
and other recommendations that seem to deal with the subject (e.g. 
RFC3986 and the document above), come to the conclusion that the 
character set and/or encoding of the query string, after 
percent-decoding, is basically undefined from a server's point of view.
Others seem to be convinced that it is Unicode encoded as UTF-8.
Yet others that it is, by default, iso-8859-1.

Now what is it ?
If I take the above quotation for instance, the part "User agents *may* 
interpret " (the emphasis is mine only) kind of bothers me, in the sense 
that it implies that the browser can do what it wants anyway.
The other part that bothers me is that according to the above, the 
"accept-charset" attribute can specify *a list* of character encodings, 
and not just one.
Then the above goes on to say "the server is able to accept any single 
character encoding per entity received". What in this case is an 
"entity" ? are we talking about the whole form submission, like in 
"query string", or are we talking individual data items, as in the 
individual "name=value" pairs ?

So basically, what will the browser pick, and how would the server know 
what it picked ?

One could argue that the server should only send forms as follows :
- the server response to the browser should contain a "Content-Type:" 
header that specifies not only the Mime type "text/html" (or 
equivalent), but add a "charset" attribute.
- the html document being sent should contain a <meta> tag that 
explicitly provides the document charset/encoding, like
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />.
- the <form> in the document should specify an "accept-charset" 
attribute, preferably with a single charset/encoding like "utf-8".

That's all nice and well, but

a) if this incoming URL is something typed by a user in the URL bar of 
the browser, there is no such previous response sent by the server.
b) HTTP being a connection-less protocol, the server should anyway not 
have any recollection that it has previously sent such a form to the 
same browser (yesterday ?), so when a request comes in, the server 
doesn't know any of these things above for sure
c) the browser may decide to do whatever it pleases and disregard what 
the server told it (IE comes to mind, practical examples on request).
It should then be in violation of the specifications, but considering 
the above I'm not so sure it is clear-cut.

For a while now, I have resorted to do all the things above, and in 
addition to always sending forms specifying 
"enctype=multipart/form-data", for which the problem should not exist.
In addition, I make sure that each form contains a hidden field, itself 
containing a string with a content known to the application, which upon 
form submission can be checked for any discrepancy (at least between 
UTF-8 and an ISO-8859 encoding; it can unfortunately not distinguish 
between different iso-8859 encodings).

But that seems like some hideous overkill, and still not totally foolproof.
(multipart/form-data also has the inconvenient that it does not play 
very well with some authentication schemes using redirects)

It seems to me that the specifications are still not clear and/or not 
tight enough.

Am I missing something ?

(And yes I know about PUNYCODE, but in my understanding that applies to 
DNS hostnames, not to query strings.)





---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Mime
View raw message