tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: getRequestURI() in relation to Connector.URIEncoding
Date Sun, 17 Feb 2013 16:54:15 GMT
Mike Wilson wrote:
> Hi Chris,
> I'm aware of the two levels of encoding but I'm wondering whether 
> servlet specification writers were :-)
> Here are two examples from Tomcat 7 running with URIEncoding="UTF-8".
> Example 1: path /ä in URL-encoded Unicode as sent from browser
>   GET /%C3%A4
>   request.getRequestURI() -> "/%C3%A4"
>   request.getPathInfo()   -> "/ä"
> Example 2: path /ä in "binary" Unicode
>   GET /.. [0xC3,0xA4]
>   request.getRequestURI() -> "/.." [0xC3,0xA4]
>   request.getPathInfo()   -> "/ä"
> So here we can see that getRequestURI() returns the path completely
> undecoded, ie doesn't apply URL decoding nor character decoding. In
> example 1 this is what I expected, but in example 2 the result is
> that getRequestURI() returns a String containing undecoded binary.
> I would expect a String to have been converted to the appropriate
> character set, otherwise the method should return a byte[].
> Internally Tomcat deals with both these examples as we can see
> getPathInfo() always return the correct decoded path, so I guess 
> this issue is all about how to interpret the servlet specification. 
> The servlet 3.0 pdf doesn't give any details on the getRequestURI() 
> method, so the only clue I can find is the getRequestURI() javadoc 
> text:
>   "The web container does not decode this String."
> but the examples given in javadoc only illustrates the removal of
> query string and don't go into any kind of encoding.
> So the question is if the javadoc "does not decode" text:
> - only applies to URL-encoding (so non-URL-encoded values should
>   go through character set decoding)
> - or, applies also when only character encoding is used (in which 
>   case I think the specification has a bug, as getRequestURI() 
>   then should return byte[])
> ?
> [Naturally, not doing URL-decoding also means that the underlying
> character encoding remains untouched. The "bug" here is when only
> character encoding is present. F ex, this appears in some mod_jk
> configurations.]

(being in a  contest with Mark E. here,)
My 2.5 cent, as someone who is not an expert at Java nor Tomcat per se, but who has spent

an extensive amount of time on the question of dealing with multiple character sets in a 
web context.

I believe that your example #2 above is simply illegal.
One is not supposed to send such bytes in a URL without URL-encoding them.
That's per the HTTP RFC itself :
RFC 2616 3.2.2 & 3.2.3 (
-> RFC 2396 part 2. URI Characters and Escape Sequences

And I believe that the fact that Tomcat is returning the "correct" translation in the 
corresponding request.getPathInfo() is purely accidental, and it could be argued that this

is a bug in Tomcat : the request should probably have been rejected, because the requested

URL was invalid.
But it was not rejected, so it filtered further down, and because you did specify that the

URL-encoding was to be seen as UTF-8, something further down the line converted this 
2-byte UTF-8 sequence in the appropriate internal representation of the character "ä" in

Java, as seen in your logging of request.getPathInfo().

(See RFC 2616, 5.1.2 Request-URI :
"The Request-URI is transmitted in the format specified in section 3.2.1. If the 
Request-URI is encoded using the "% HEX HEX" encoding [42], the origin server MUST decode

the Request-URI in order to properly interpret the request. Servers SHOULD respond to 
invalid Request-URIs with an appropriate status code. ")

So if we disregard this invalid URL example #2 (since it is invalid and thus any further 
behaviour could be considered as "undefined"), we are left with the general case #1.

The RFCs 2616 and 2396 do not mandate any specific character set/encoding for the request.
The only thing that they say, is that if the request contains bytes other than the ones 
considered as "reserved" or "safe", they should be "URL-encoded" prior to transmission by

the client to the server; and that the first thing that the server should do on reception,

is to "URL-decode" them and restore the original bytes representation, as the client meant

to send them.

And here is one area where the specs are failing : there is no way, in the HTTP protocol,

for the client to indicate to the server what the original character set/encoding of the 
URL is; so how can the server know ?

My own interpretation would be as follows :
- in the absence of any other information, the URL after URL-decoding should be viewed as

being in the ISO-8859-1 encoding, as this is the "default character set/encoding" for HTTP

(1.1) in general.
- and any other interpretation depends on a prior agreement between client and server.

And the URIEncoding attribute of the Tomcat Connector can be considered as such a prior 
client-server agreement, like : "in all the applications accessed through this Connector,

the client and the server agree beforehand that any URLs requested by the client will be 
Unicode, UTF-8 encoded".

In other words, if your application can guarantee that any request URL sent by one of its

cients will be UTF-8 encoded, /then/ you can use the URIEncoding="UTF-8" attribute in 
Tomcat.  And only then.
(because e.g. if one of the client users /types/ a URL in the URL bar of his browser, and

this URL happens to target your Tomcat application, you can never be sure that the URL 
will be UTF-8 encoded when the browser sends it, because that depends on the settings in 
the browser)

The URIencoding attribute is something which Tomcat adds, outside the HTTP specification 
(and even outside the Servlet Spec, AFAIK), to make life easier for the Tomcat application

programmers : because Tomcat webapps are written in Java; because the internal character 
set of Java is Unicode; and because it is likely, on a Tomcat host, that all static and 
JSP pages will be saved as UTF-8 encoded, therefore it is easier to allow the programmer 
to just "assume" that when he uses request.getPathInfo() (or similar calls like 
request.getParameters()), he will get a Java string, properly decoded, if the client sent

it that way (which in the general case it would mostly do).

And then, to get back to the initial question, I would assume that request.getRequestURI()

is really meant as a "low-level" call, which returns the request URI "as is", before /any/

interpretation has taken place (not even the URL-decoding (which should happen first), and

much less any character set decoding (which should happen later)).
While the other calls (like request.getPathInfo() are higher-level calls, which return 
strings which have already been URL-decoded and character-set decoded.

And if you want to see the underlying issues in all their glory, I suggest the following 
experiment :
1) in a Linux system's shell window, set your locale to one based on UTF-8. (and make sure

that your "terminal" is also set that way).
    Then inside one of your webapp's directories, create a file named "ÄÖÜ.txt" (I am 
assuming that you can enter that, considering your examples above), with some text A in 
it.  After creating the file, do an "ls" and a "cat" to see what you got.
2) change your locale and client settings to one based on ISO-8859-1, and create another 
file named "ÄÖÜ.txt", with some different text B content.  Do an "ls" and a "cat" again,

to see that you really have 2 files with different names and contents.
3) now use a browser (preferably IE for once), and try to request either one of these 
files through Tomcat, by typing your request in the browser's URL bar.
You can play around with the settings of the browser (send URLs as ..), with the 
URIencoding attribute in the Tomcat Connector, and the "locale" under which Tomcat is started.
To vary a bit, you can also try to put the corresponding links in a couple of html pages,

with different encodings for the pages.
For even more fun, you can also create a little webapp which will accept the name of the 
desired file as a request parameter, open it and return its content.

It is only to English-speaking Java programmers writing English-speaking applications that

the matter may appear simple and settled.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message