cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tagunov Anthony" <>
Subject [C2,C1][RT] What can be done about the encoding problems?
Date Sat, 27 Jan 2001 18:49:52 GMT
Hello, evrybody!

I've got a question that I do not know how to solve propelly.
I'd like to ask if anybody has any idea on how to solve the problem
of getting data passed from the forms for pages being encoded with
anything other then Latin-1?

I do know, if this question has been resolved in the latest Servlet API
versions, but as we know the reality is that neither Tomcat 3.x.x (do not
know about 4.0) actually has got no support for that.

And here is the trouble:

I would like everybody who knows this all already to excuse me
for too many words.

If we have a HTML page in Latin-1, and have a input form it, then
latin letters come as latin letters, spaces as "+"-s and some characters
as %xy.

Then all these characters get decoded by the servlet engine and evrybody
is happy.

Now, suppose the encoding of the page is windows-1251 (the most reliable
way of giving the encoding is putting it into HTTP headers, as we know, and
it's just what Cocoon does :)))

Suppose we have cirillic support installed on our windows and we type
ciryllic letters into the form in our IE (unfortunantly Netscape more
often the IE has problems with cirillic fonts, but that's not the matter, basicly
what Netscape does is the same as IE). We submit the form.

Suppose there was cirillica letter A inputted. It's code in windows-1251 is
above 127 but bellow 255, let us suppose it's code in hex is xy, then
the browser puts %xy in the request. (what happens for Post is
efficiently the same. I'm not sure if this letter A will get %-ed wiht Post, but
the code would still be xy hex. Then at our server side the servlet engine
(I've tested it with Tomcat 3.2b7, but I beleive that other engines do the same)
gets our %xy code, considers that this is a code of some Latin-1 sybol and
converts it into a wrong Java character (as Java is unicode :).

To get it back to normal one has to do
new String((request.getParameter("foo")).getBytes("8859_1"),"windows-1251")

-- e.g. decode it back into bytes and restore with respect to the correct encoding.

If we our page is in "uft-8"  encoding (and we already have some :))
then the situation is similar:

If we have cirillics support in our system and type letter "A" (cirillic) then
it's unicode code (0410 hex) gets coverted to utf-8 bytes (it looks
smth like 0d0a hex -- not sure what it is exactly) and the browser sends

What does the servlet engine do? Right! It treats this as two Latin-1
characters. And we again have to 

new String((request.getParameter("foo")).getBytes("8859_1"),"uft-8")

In general, we should guess,
if the page was sent to the browser as having encoding "xyz", the
characters in the input forms get converted to sequences of bytes representing

these characters in "xyz" encoding, % 'em and send to the server.

The servlet containers currently (I beleive that the majority of them 
do so, please correct me, if I'm wrong) treat these as sequeses
of bytes representing text in "8859_1".

So to get text back to normal we generally have to 

new String((request.getParameter("foo")).getBytes("8859_1"),"xyz")

I do not know if there's much progress with this problem at sun where
they develop Servlet API's but maybe we should try to solve this problem
at Cocoon level, at least for the reason that many servlet engines in reality,
even if Sun proposes a solution in servlet API many will still have
to run on servlet engines not supporting this.

Anyway the end-users of cocoon who have to use chacter sets other then
"Latin-1" have to solve this problem by themselves. BTW: we feed WAP
phones with UTF-8 encoded pages (I do not know if thy wap gateways
are intelligent enough to be fed with anything else). And if via a mobile
phone we input cirilics on page sent in UTF-8 then it get sent to our servlet
engine -- correct as UTF-8.

And what this solution could be I do not know currently.
This is why I've written this email. Maybe someone can
propose a solution.

The simplest solution is to place an "encoding=" attribute on 
the request taglibs get-parameter family. 

(That what we have done -- developed a duplicating 
taglib for request.xsl and have implemented this attribute.)
So, maybe it should be the first step in this direction: to
allow enc= or <encoding>..</encoding> with 

Actually this is needed once per page.

But the parameter value, AFAIK may get used in 
different places too. Maby in sitemap matching, 
maybe in xslt transform, get used by customly written
transformer. Who knows where else!

Is there any layer planned to be between HttpServletRequest and
all the Generators, Translators, Sitemap (?), etc in C2 (sorry not to know the
architecture yet!) ?

If yes (there's one or one is planned) maybe it's good place to perform 
the translation.

The trouble is that the encoding doesn't get sent in HttpServletRequest
(if only it was, but we've checked it many times in vain! :(

So we have to get the knowledge of the page encoding from
extrernal sources. Actually we should know what 
the page from which the form was submitted was encoded with.

Do not like to pass this as a hidden field!

So only the site creator knows that. Maybe it should
be specified in the sitemap?

(This assumes that we can judge about the encoding
by the data available for matching, but this is often true,
f.e. we might have (in near future :) all our pages of the pattern //english/..
encoded with 8859_1 and all pages //russian/.. encoded with
cp1251  (BTW: in C1 we did this by using different media
types  text/html, text/chtml that differed only in the encoding, 
I hope smth similar is available :)?

Okay, everybody. Thanx for attention. And questions are:
1) should it be solved in Cocoon at all? (I believe that -- yes ;)
2) should be more then just encoding=xyz in the request taglib, because the parameters 
  are used in wide variety of places in C2?
3) How to solve this elegantly??? (THIS IS THE GREATEST QUESTION, ISN'T IT? :)

Best regards, Tagunov Anthony

View raw message