tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Semicolon URI encoding and RFC
Date Mon, 09 May 2011 15:11:15 GMT

This whole question is a pain in the a.. , and I personally do not understand how a 
million marketing people can be talking of "web 2.0" and "web 3.0", but not have been able

to come out with HTTP 2.0 where URLs (and everything else) would be by default 
Unicode/UTF-8 instead of ASCII and/or ISO-latin-1.

But things being what they are, to answer your question to the best of my abilities, and 
trying to avoid jargon and twisted language :
- Basically, a URL "in transit" between a client and a server, should contain only *bytes*

with individual byte values between 0 and 127 decimal.
Thus when it is about to send a URL to a server, any client should examine the URL 
byte-by-byte, and if any of these bytes would be outside the 0-127 range, it should 
replace it by a 3-byte sequence %xy, where xy is the hexadecimal representation of the 
byte value.
And then there are some additional rules for some of the bytes 0-127, which either forbid

them in a URL, or also specify that you have to encode them with the %xy logic, or 
differently (like a space encoded as a "+", and a "+" encoded as %xy), and/or when (as 
Konstantin explains below for the ";").

At the server side, the first thing which the server should do with this URL, is to make 
the inverse translation : examine the URL and replace any %xy sequence by the single byte

value which this sequence represented in transit (and "+" by space).

And /then/ starts the circus.

Because there is nothing in the RFCs that would enable the server to know, after this 
URL-decoding, in which character set the client expressed this URL.

So basically, the interpretation of at least part of the URL falls to the server-side 
application, and the client is supposed to send "the right thing" so that the application

does not get confused. And there is no real way for the server to force the client to do 
the right thing.
And if either side does not respect whatever convention they have between them, one of the

sides will get confused.

To my knowledge, there exists no Internet RFC which contradicts what I am writing above.
It is a definite hole in the specs, and one which nowadays is costing a lot of time being

lost in confusion and half-way patching attempts (*).
I can understand that when HTTP 1.0 was first defined 15 years ago now, this was a 
perfectly valid position to take.  But I personally do not understand why nowadays, 15 
years and 100 million worldwide webservers later, and now that Unicode/UTF-8 support is 
ubiquitous, we are still at the same point.

(*) such as IE's "always send URLs as UTF-8", and Tomcat's "useBodyEncodingForURL" hacks.

Mindaugas Žakšauskas wrote:
> On Mon, May 9, 2011 at 2:03 PM, Konstantin Kolinko
> <> wrote:
> <..>
>> If ";" is part of the actual path, it must be escaped.
>> If ";" starts a "path parameter" it must be unescaped. One well-known
>> example is ";jsessionid" path parameter.
> Thanks for your answer. Is this rule is just "de facto" rule, or is it
> documented anywhere in RFC3986/RFC2396?
> Extending my question, is there a clear criteria which would define
> which characters always need escaping and which don't? At the moment I
> am escaping everything that is not unreserved [1], but I am not sure
> about SEOability and user-friendliness - this especially concerns path
> with international characters in URLs, e.g. http://site/pathąčęė
> I have also found a similar Tomcat bug [2], but it is addressing
> slightly different issue.
> If anyone wants to benefit this, I have just added 50 bonus points to
> my SO question [3]. The main question I want to get answer for is -
> which characters can and which need escaping, both in terms of RFC and
> Tomcat.
> Regards,
> Mindaugas
> 1. According to RFC 3986, unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
> 2.
> 3.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message