cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pier Fumagalli <p...@betaversion.org>
Subject Re: [RT] About charsets (character encoding) and servlet API
Date Sun, 30 May 2004 01:08:19 GMT
On 29 May 2004, at 16:11, Antonio Gallardo wrote:
>
> I think most of us are using servlet containers with servlet specs 2.3 
> or
> superior. In that way, I think it is time to move to a higher servlet 
> API
> specs? I think just this little things are enough.

I've been doing i18n work on Servlets for a _very_ long time and, dude, 
I've never seen a problem with the API ever...

Let's split the problem in three parts: headers and body and URLs:

--------
HEADERS:
--------

Now, the HTTP spec defines that a header needs to follow the RFC-822 
section 3.1 specification, therefore (I'm going on memory here, not 
cross checking) the header name must be composed of only a strict 
subset of US-ASCII characters, and the header value can ONLY be made up 
of ISO-88559-1 characters.

No problemo here...

At around page 16 of the RFC-2616 Roy also mentions that IF you want to 
encode something in headers that IS NOT encodable in ISO-8859-1, you 
gotta follow RFC-2047 (Mime Part 3) which defines clearly how such 
values are encoded...

Now, when we do a setHeader in the response, or do a getHeader from the 
request, the servlet container SHOULD parse/encode out the values in 
the correct way, although I've never seen any of them doing it (they 
simply ignore the whole shabang and use ISO-8859-1 for both header 
names and values and don't do any additional parsing/engoding.

Bug in the servlet containers...

-----
BODY:
-----

RFC-2616 is _very_ clear at this point, if you don't specify the 
charset token in the "Content-Type" header, and you specify (or imply) 
that the body is "text/something" you SHOULD assume that you're 
receiving / sending text encoded in ISO-8859-1...

Again, I seriously don't think that servlet containers check for the 
encoding of the request body when the content type is 
"application/x-www-form-urlencoded", because I _suppose_ that given 
that it doesn't start with "text/..." they ignore the whole shabang...

So, I believe that in some cases, the encoding of parameters returned 
by servlet containers MIGHT be wrong (but I ain't sure, haven't checked 
that lately).

When you send, on the other hand, the servlet API doesn't have much 
functionalities until 2.4 to set the charset encoding of the response, 
but that _really_ affected only stupid JSPs which were never though 
right anyway...

In Cocoon (I hope) we should never rely on the "getWriter()" returned 
by the servlet container but ALWAYS use a "getOutputStream()" and set 
ALWAYS the content type with the proper "charset" token...

If we don't we're kinda violating 3.4.1 of RFC-2616 as it says that one 
SHOULD always put the charset in there (if relevant, of course).

So, the problem is only in reading parameters, and that should be fixed 
at the servlet container level.

----
URL:
----

URLs are important as sometimes the request parameters are passed as 
query string attached to them...

Initially they were defined on US-ASCII and/or ISO-8859-1 (can't 
remember which one exactly) and that all non-printable characters had 
to be encoded with the usual percent-number-number format...

Great...

Between the W3C and RFC-2718 someone decided (at the end of the whole 
discussion) that URLs, in their internationalizable format only had to 
change in one aspect: the character encoding.

So, an URL nowadays (tested on my girlfriend's Jappo-Internet-Explorer) 
are sequences of bytes representing a string encoded in UTF-8, and the 
same rule applies of encoding the characters outside of the 
originally-defined printable ones with the usual percent-number-number 
re-encoding...

Again, I seriously don't think that any servlet container does this 
check, so, if we get wrong request parameters when someone browses in 
Japanese and posts a GET form, is not our fault...

-----------
CONCLUSION:
-----------

I believe Jon Postel once said "be strict in what you send, be liberal 
in what you accept" and this principle has been forgotten by the 
servlet-container implementors...

We can be strict as much as we can by sending the right stuff (as the 
servlet API allows us to do it by using OutputStream(s) instead of 
Writer), but we cannot be liberal in what we accept as URLs and request 
parameters are already pre-parsed for us into nice unicode-based Java 
String(s).

As far as I can see (and by the "trick" you outlined)

new String(value.getBytes("8859_1"), "utf-8")

servlet containers simply ignore that there's a world out there that 
DOES NOT speak english, and cut shortcuts to increase their parsing 
speed...

Unfortunately, there's not much we can do (apart from brutal hacks like 
the one mentioned above) to get parameters from my girlfriend's 
Jappo-browser.

One thing we could do, though, is to make sure that the communities 
building our servlet container of choice are aware of those problems, 
so, rather than reinventing hacks in Cocoon, I'd say, post those issues 
as bugs for Tomcat and Jetty and let them sort out the whole mess...

It ain't our fault, and unfortunately, we can only properly fix only 
one side of the story, what we send...

	Pier

Mime
View raw message