tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kitching Simon <>
Subject off-topic: handling non-ascii characters in URLs
Date Fri, 05 Jan 2001 10:58:04 GMT
Hi All,

While following a related thread (RE: a simple test to charset), 
a question occured to me about charset encodings in URLs. 
This isn't really tomcat-related (more to do with HTTP standards) 
but thought someone here might be able to offer an answer.

When a webserver sends content to a browser, it can indicate
the character data format (ascii, latin-1, UTF8, etc) as an http
header. However, how is the character data type specified for data
send *by* a browser *to* a webserver (ie GET or POST action)?

Andre Alves had an example where an e-accent character
was part of the URL. I saw that IE4 replaced this character
with %E9 when submitting a form using GET method, but this
really assumes that the receiving webserver is using latin-1.

There is this thing called an "entity-header" defined in the HTTP
specs, which may contain a "content-encoding" entry. This seems
to cover POST urls ok then, as the POSTed data is in an entity-body,
and therefore an entity-header can be used to define its encoding.

But the URLs themselves cannot have their encoding specified by 
an entity-header, because they are not in an entity-body. So does
this mean that all URLs should be restricted to ascii, and forms
should not use GET method unless their data content is guarunteed
to be all-ascii??  I remember seeing an article recently about domain
names now being available in asian ideogram characters, which seems
to indicate otherwise....

Any comments?



View raw message