tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Je suis la poubelle <laps...@gmail.com>
Subject Re: Tomcat 5 and UTF-8
Date Thu, 02 Apr 2009 17:30:28 GMT
On Fri, Mar 27, 2009 at 5:34 PM, Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Oscar,
>
> On 3/27/2009 10:35 AM, Je suis la poubelle wrote:
> > 1. In those mentioned web pages, I noticed that none of them explicitly
> > specified the following HTML header:
> > <head>
> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> > </head>
>
> That's because setting a META tag that doesn't match reality is not
> really a good idea. I can set the charset to shift-js in the META tag
> but it doesn't mean the page is actually in Japanese.


     I don't see your point....

    Setting charset/encoding is to specify computerized information.  It's
not just a matter of language.  If setting charset in META tag doesn't mean
anything to you, the same argument applies to setting charset in HTTP
header.


>  > And what if another encoding is specified in HTML header, say
> > ISO-8859-1?  Which one would the browser use in priority?  Nobody knows
> the
> > answer!
>
> Actually, everybody knows the answer, because it's published in the HTML
> specification: http://www.w3.org/TR/html4/charset.html#h-5.2.2
>
> "
> To sum up, conforming user agents must observe the following priorities
> when determining a document's character encoding (from highest priority
> to lowest):
>
>   1. An HTTP "charset" parameter in a "Content-Type" field.
>   2. A META declaration with "http-equiv" set to "Content-Type" and a
>      value set for "charset".
>   3. The charset attribute set on an element that designates an
>      external resource.
> "
>

     Yes, yes, but this is the theoretical answer.  Not in practice.  When
there's a bug, there's a bug.


>  > That's why I specify the encoding in both places.
>
> While it's not a terrible idea to specify the encoding in both places,
> you should consider the possibility that the META tag can be wrong.


     It's not only "not a terrible idea", but a good habit to do so.  Just
like the principle of "double check": what's the point of double check if
everything works as expected?  We do "double check" because in practice we
are subject to errors.

     A good programmer should never leave anything to chance, that's why
it's good to set charset in both HTTP header as well as in HTML header.


> > 2. To make things easier for myself, I always save JSP files in UTF-8
> > encoding, and I always put this header as well:
> > <%@ page pageEncoding="utf-8" %>
> > Now everything's in UTF8 from A to Z.
>
> If you're following guidelines for i18n, you'll put your non-ASCII
> strings into property files and won't have to worry about the encoding
> of the JSP source file.


     Yes, but again, you're talking in theoretical viewpoint.  What about if
I want to create a small, quick and short JSP just for some tests?  I won't
go into changing everything.  The simpler is to change one thing: my JSP
file.

     Another situation: what if you don't have total access to all files?
Well, if Tomcat is in your computer, it's taken as granted that you could do
everything.  But what about you're developping a JSP site and have it hosted
in some Internet servers?  Are you sure you still have all access?  And as
your link to HTTP recommendation says, some server might not send HTTP
header.  You'd better also set the charset in HTML header.

     One more example: you're doing the test in one JSP file in your
corner.  Everything works perfectly.  Then you move the file to another
server.  In this situation, it's better to have the file self-contained.


> > String sUTF8  = new String(sWrongEncoding.getBytes("iso-8859-1"),
> "UTF8");
>
> I think that should be "rightString", not "sUTF8", since the String
> object has no inherent encoding.


     Not true.  Java string is inherently using UTF-16.  If you're so picky
on the name, you'd better call it
latin1StringConvertedBackToUTF8BeforeConvertedBackToUTF16 .... but this is
getting ridiculous....

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message