tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Char Encoding text streams on Tomcat 5.5 and Linux
Date Tue, 01 Dec 2009 21:51:25 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 11/30/2009 7:39 PM, André Warnier wrote:
> Well, just make a simple test :
> (I don't really know how to handle JSP pages, I only do servlets and
> filters, otherwise I'd do it myself).

:)

> -  create a simple html form with a UTF-8 charset, with a simple text
> input box. Give it a method=POST.

Done:

<%@page language="Java" contentType="text/html" pageEncoding="UTF-8" %>
<html>
  <body>
<%
  if("POST".equals(request.getMethod())) {
%>
  <p>
    file.encoding: <%= System.getProperty("file.encoding") %><br />
    ContentType: <%= request.getContentType() %><br />
    Charset: <%= request.getCharacterEncoding() %><br />
  </p>
  <p>
    Received text from client: <%= request.getParameter("q") %>" />
  </p>
<%
  }
%>
    <form method="POST" accept-charset="UTF-8">
      <input name="q" type="text" value="<%= request.getParameter("q") "
%> />

      <input type="submit" />
    </form>
  </body>
</html>

> - then start Tomcat alternatively under a UTF-8 locale, then an
> ISO-8859-1 locale, type some accented characters in your input box, and
> submit the form.

Here's what I get when I submit your name, properly accented, into this
form:

"
file.encoding: UTF-8
ContentType: application/x-www-form-urlencoded
Charset: null

Received text from client: André
"

[note that this is "Andr" followed by a capital "A" with a tilde (~) on
top of it, followed by a copyright symbol "(c)" as two separate characters).

file.encoding is already UTF-8 and still this "bug" presents itself.

I tried re-starting Tomcat with LANG=en_US.ISO-8859-1 and I get this result:

"
file.encoding: ANSI_X3.4-1968
ContentType: application/x-www-form-urlencoded
Charset: null

Received text from client: André
"

The output is the same, except for the file.encoding which has changed:
Tomcat bones the interpretation of these strings in both situations.

Note that the client supplied no character encoding along with the
request, and that the form indicates that it accepts only UTF-8
encoding. Here, the client has screwed everything up by not including
the encoding of the form. Here's what gets sent over the wire:

Content-Type: application/x-www-form-urlencoded
Content-Length: 12
q=Andr%C3%A9

Notice that the bytes are Andr + 0xC3 0xA9

Let's see if we can manage to get that string of bytes some other way.

public class CharacterEncoding
{
    private static final char[] hex = "0123456789abcdef".toCharArray();

    public static String toByteString(byte[] a)
    {
        StringBuilder sb = new StringBuilder(a.length * 3);

        for(int i=0; i<a.length; ++i)
        {
            int high = (a[i] & 0xf0) >> 4;
            int low  = (a[i] & 0x0f);

            sb.append(hex[high]);
            sb.append(hex[low]);
            sb.append(' ');
        }

        return sb.toString();
    }

    public static void main(String[] args)
        throws Exception
    {
        String s = "André";

        System.out.println("Original string: " + s);
        System.out.println("UTF-8 bytes: " +
toByteString(s.getBytes("UTF-8")));
        System.out.println("ISO-8859-1 bytes: " +
toByteString(s.getBytes("ISO-8859-1")));
    }
}

The output of this program is:

Original string: André
UTF-8 bytes: 41 6e 64 72 c3 a9
ISO-8859-1 bytes: 41 6e 64 72 e9

You may recall that my web browser sent

41 6e 64 72 c3 a9

...which is the correct UTF-8 byte encoding of "André". The client is
using UTF-8 but leaving the Content-Type blank, which makes this a
client problem IMO.

The only solution is to use a force-UTF-8-filter when the client fails
to provide a character encoding along with a request. It's an ugly hack
and I'm disappointed that the venerable Firefox still has this problem.

Let's see what happens when we take the UTF-8 string and interpret it as
ISO-8859-1:

Bytes:      41 6e 64 72 c3 a9
UTF-8:       A  n  d  r     é   (note the é takes two bytes to express)
ISO-8859-1:  A  n  d  r  Ã  ©

So, in the absence of any other information, Tomcat is receiving a byte
string and must interpret it according to the spec, which is to default
to ISO-8859-1 since there is no charaset supplied with the Content-Type.

> I'm curious about the horrendous bug, because I have seen phenomenons
> like this.

See: not a bug in Tomcat. It's everyone else who's wrong :)

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksVj90ACgkQ9CaO5/Lv0PA2gACgqnqziMA8J6qwF7RjgekT8YAh
Dz4AnRFTg95KN0VW7fVmKkxTaDgvDJ9R
=VUv9
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message