tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: Char Encoding text streams on Tomcat 5.5 and Linux
Date Wed, 02 Dec 2009 10:34:38 GMT
Hi.
Just a quick line : Thank you for the test, and I am not forgetting 
this, since I would really like to get to the bottom of it.
I am fairly busy for the next 2 days however, and will revisit that 
after the current rush.
I have a definite case where I am forced to set Tomcat's startup locale 
to iso-8859-1, otherwise I am getting wrong encodings. But I need to 
review the exact characteristics of the case before continuing.


Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> André,
> 
> On 11/30/2009 7:39 PM, André Warnier wrote:
>> Well, just make a simple test :
>> (I don't really know how to handle JSP pages, I only do servlets and
>> filters, otherwise I'd do it myself).
> 
> :)
> 
>> -  create a simple html form with a UTF-8 charset, with a simple text
>> input box. Give it a method=POST.
> 
> Done:
> 
> <%@page language="Java" contentType="text/html" pageEncoding="UTF-8" %>
> <html>
>   <body>
> <%
>   if("POST".equals(request.getMethod())) {
> %>
>   <p>
>     file.encoding: <%= System.getProperty("file.encoding") %><br />
>     ContentType: <%= request.getContentType() %><br />
>     Charset: <%= request.getCharacterEncoding() %><br />
>   </p>
>   <p>
>     Received text from client: <%= request.getParameter("q") %>" />
>   </p>
> <%
>   }
> %>
>     <form method="POST" accept-charset="UTF-8">
>       <input name="q" type="text" value="<%= request.getParameter("q") "
> %> />
> 
>       <input type="submit" />
>     </form>
>   </body>
> </html>
> 
>> - then start Tomcat alternatively under a UTF-8 locale, then an
>> ISO-8859-1 locale, type some accented characters in your input box, and
>> submit the form.
> 
> Here's what I get when I submit your name, properly accented, into this
> form:
> 
> "
> file.encoding: UTF-8
> ContentType: application/x-www-form-urlencoded
> Charset: null
> 
> Received text from client: André
> "
> 
> [note that this is "Andr" followed by a capital "A" with a tilde (~) on
> top of it, followed by a copyright symbol "(c)" as two separate characters).
> 
> file.encoding is already UTF-8 and still this "bug" presents itself.
> 
> I tried re-starting Tomcat with LANG=en_US.ISO-8859-1 and I get this result:
> 
> "
> file.encoding: ANSI_X3.4-1968
> ContentType: application/x-www-form-urlencoded
> Charset: null
> 
> Received text from client: André
> "
> 
> The output is the same, except for the file.encoding which has changed:
> Tomcat bones the interpretation of these strings in both situations.
> 
> Note that the client supplied no character encoding along with the
> request, and that the form indicates that it accepts only UTF-8
> encoding. Here, the client has screwed everything up by not including
> the encoding of the form. Here's what gets sent over the wire:
> 
> Content-Type: application/x-www-form-urlencoded
> Content-Length: 12
> q=Andr%C3%A9
> 
> Notice that the bytes are Andr + 0xC3 0xA9
> 
> Let's see if we can manage to get that string of bytes some other way.
> 
> public class CharacterEncoding
> {
>     private static final char[] hex = "0123456789abcdef".toCharArray();
> 
>     public static String toByteString(byte[] a)
>     {
>         StringBuilder sb = new StringBuilder(a.length * 3);
> 
>         for(int i=0; i<a.length; ++i)
>         {
>             int high = (a[i] & 0xf0) >> 4;
>             int low  = (a[i] & 0x0f);
> 
>             sb.append(hex[high]);
>             sb.append(hex[low]);
>             sb.append(' ');
>         }
> 
>         return sb.toString();
>     }
> 
>     public static void main(String[] args)
>         throws Exception
>     {
>         String s = "André";
> 
>         System.out.println("Original string: " + s);
>         System.out.println("UTF-8 bytes: " +
> toByteString(s.getBytes("UTF-8")));
>         System.out.println("ISO-8859-1 bytes: " +
> toByteString(s.getBytes("ISO-8859-1")));
>     }
> }
> 
> The output of this program is:
> 
> Original string: André
> UTF-8 bytes: 41 6e 64 72 c3 a9
> ISO-8859-1 bytes: 41 6e 64 72 e9
> 
> You may recall that my web browser sent
> 
> 41 6e 64 72 c3 a9
> 
> ...which is the correct UTF-8 byte encoding of "André". The client is
> using UTF-8 but leaving the Content-Type blank, which makes this a
> client problem IMO.
> 
> The only solution is to use a force-UTF-8-filter when the client fails
> to provide a character encoding along with a request. It's an ugly hack
> and I'm disappointed that the venerable Firefox still has this problem.
> 
> Let's see what happens when we take the UTF-8 string and interpret it as
> ISO-8859-1:
> 
> Bytes:      41 6e 64 72 c3 a9
> UTF-8:       A  n  d  r     é   (note the é takes two bytes to express)
> ISO-8859-1:  A  n  d  r  Ã  ©
> 
> So, in the absence of any other information, Tomcat is receiving a byte
> string and must interpret it according to the spec, which is to default
> to ISO-8859-1 since there is no charaset supplied with the Content-Type.
> 
>> I'm curious about the horrendous bug, because I have seen phenomenons
>> like this.
> 
> See: not a bug in Tomcat. It's everyone else who's wrong :)
> 
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAksVj90ACgkQ9CaO5/Lv0PA2gACgqnqziMA8J6qwF7RjgekT8YAh
> Dz4AnRFTg95KN0VW7fVmKkxTaDgvDJ9R
> =VUv9
> -----END PGP SIGNATURE-----
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message