tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Encoding problem with Tomcat (hibernate) + Postgres
Date Wed, 24 Feb 2010 11:35:15 GMT
davefu wrote:
> Hi, this is my setup:
> - Debian Lenny 
> - Tomcat 5.5
> - Postgres 8.3 
> I'm running an app which is failing everytime it tries to get some data from
> the DB with characters like [ÁÉÍÓÚáéíóú]. By "failing" I mean the
> application isn't showing the data it should when Tomcat throws querys to
> Postgres.
There is no problem with your English, you are doing fine.

Where is it showing this (wrong) data ?
Do you mean in the result page which you see in the browser ?
Have you saved this page to disk, and examined it with an editor, to see 
what is really contained in that html file ?

I don't know what the exact problem might be, but let me give you (1) my 
sympathy, because these problems are usually horrible, and (2) a 
recommendation first of all, before even starting to decipher the problem :

You have to question *everything* you see, and not take anything for 

For example, when you edit a logfile with an editor, you have to 
question whether what you see on the screen through this editor, is 
really what is in the logfile, byte-by-byte.
If possible, use a non-UTF8 locale, and an editor which is /not/ UTF-8 
aware, and really shows you the /bytes/ in the logfile, rather than the 
/characters/.  It may be less readable, but at least you will be sure 
that you really see what is in the logfile.

Then you have to ask yourself the question : does the program which 
writes the logfile, write it "as characters" using a UTF-8 encoding, or 
not ? (it is not as evident as one may think at first)

This may all sound silly, but I can guarantee that if you do not ask 
yourself these questions first, and at every step, you may be going down 
the wrong track when trying to understand what is going on.

For example, when you say that you see this in the logfile :

is that first "capital A tilde" really one byte /in the logfile/, or is 
it itself already the 2-byte UTF-8 encoding of the "capital A tilde" 
character, which you see on your screen as a single A tilde character, 
because your editor and your locale have conspired to translate it 
visually ?

(and, when you post it here, has it been re-encoded one more time by the 
  email program ? ;-) )

(Note that it also looks like, in the snippet above, you have more than 
just UTF-8 encoding going on; there are also "entities" such as 
"‰". Where do these come from ?)

Next, when you change a setting which has an impact on the encoding, do 
it one change at a time, and double-check the result along the lines 
above.  Some of the things which may have an impact are :
- the default system locale
- the "locale" of the process which is running the database
- the encoding settings of the database itself
- the "locale" of the process under which Tomcat is running
- whether or not the application "streams" which are used to communicate 
with the database use the "default platform encoding", or have a 
specific encoding specified when opening the stream
- the locale of the process you are using to run the editor which you 
use to look at the logfiles
- the settings of that editor
- and probably quite a few others which I forget

Each one of these may cause some intermediate translation which is not 
evident at first.

Another example : XML files have, by definition, a "default encoding" 
which is UTF-8, or else the encoding is specified in the leading XML 
declaration.  XML parsers know this, and will read the file in the 
appropriate encoding. The same is probably true for HTML parsers 
(although in that case the default should be ISO-8859-1).
So for example a JSP page that Tomcat uses, will always be read correctly.
But the same is not true for sockets that Tomcat may open to talk to 
some external software.  If such a socket is opened without specifying 
an encoding, then it will default to the "default platform encoding", 
which in the case of Tomcat is the encoding of the process running the 
JVM which runs Tomcat.
And the same is also not true for anything that goes over the HTTP 
protocol.  There the default is ISO-8859-1, unless explicitly specified 

So yes, it is a mess, be prepared.
But it is also interesting.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message