tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: ServletWebRequest.getServletPath() returns strange values on uris with german umlauts
Date Thu, 27 Jan 2011 22:45:39 GMT
asbachb wrote:
> Thanks for you reply.
> 
> I checked my clients request to tomcat which shows that the umlauts are
> correctly replaced with their enities:
> 
> GET
> "http://localhost:8080/wicket-umlauts-1.0-SNAPSHOT/page/param/v%C3%A4lue-xxx"
> 
> This request should be a valid ASCII request and shouldn't be a problem to
> decode?
> 
> 
I understand what you mean, and you are right, but in a case like this you have to be very

careful in your use of vocabulary.
The term "ASCII" is usually reserved for talking about a character set (or alphabet) which

includes only 128 codes, represented by one byte per character, of which the letters are 
A-Z and a-z.  Basically thus the English alphabet.
An "umlaut" is a diacritic mark.
An "lowercase a with umlaut" is a letter of the German alphabet (and probably others).
The term "entity" is usually used in the context of XML or HTML, to denote something of 
the form "&xxx;" where "xxx" represents the name of a symbol.
And "/wicket-umlauts-1.0-SNAPSHOT/page/param/v%C3%A4lue-xxx" seems to be the result of 2 
consecutive steps :
a) the client composes a URL as a Unicode String, and encodes it using the UTF-8 encoding
b) after (a), it scans this URL for any byte/character that is not valid in a URL (as per

RFC 2396) and "URL-encodes" it, which consists of replacing the offending byte by its 
encoding as "%xy", where "xy" is the hexadecimal representation of the byte value.

The server, when it receives this request,
c) "URL-decodes" the URL, replacing each "%xy" sequence by the corresponding single-byte code
d) and then, it depends..
If you have told the server to decode the URL (after (c)) as if it was UTF-8/Unicode, then

the server will do that, to generate an internal Java Unicode String.

This is not the default.  You have to tell the server to do that.  With Tomcat, you do 
that by using the 'URIencoding="UTF-8"' attribute of the Connector.
(You cannot in this case use the "useBodyEncodingForURI" atribute, because for a GET 
request, there is no body (and thus no body encoding of course)).

If you have done that, and your application asks Tomcat for the URL String directly, then

you should get the correct Java (Unicode) String in response.
(You should be able to check this easily with a simple JSP page).

Now if you get this path via a call specific to the "wicket" application you are using, 
then you have to check in that application what happens, to make the result different.
Maybe this "wicket" thing does its own decoding of the path, resulting in a (wrong) 
double-decoding ?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message