tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikola Milutinovic <>
Subject Re: ++ Best practive ?? ++ (JSP-->Servlet-->Database) character encoding.
Date Wed, 01 Sep 2004 10:28:38 GMT
Ben Bookey wrote:

> Dear list,
> We have a web-based jsp-servlet application performing updates, deletes and
> inserts into an oracle database running with Tomcat 5. We want to support
> both
> american, and european customer client locales, so we want to use either
> ISO-8859-15 or utf-8. But we are having problems saving the Euro symbol when
> using ISO-8859-15 encoding.

Since you have to support multiple character sets, it would be cleaner 
if you chose UTF-8 for your DB, in the first place. I do realise that 
data conversion can be a tremendous task, so your mileage may vary.

> I had previously assumed that because java works with unicode as default,
> that all data entered in a HTML form would be saved therefore as UTF-8 into
> the database. (i.e. as soon as a value is assigned to  the a java dataobject
> e.g. string or int). I am beginning to think this not to be case, and that
> all data is saved in the database based on the original encoding as posted
> by the browser. Please can someone explain what is really going on?? Do i
> need to have some code which, checks the browser encoding in the HTTP
> header, and then convert/parse accordingly to a chosen standard. This will
> then avoid the situation that our database could end up containing records
> in different character encoding systems, which I suspect is what is now
> happening.

First of all, Tomcat, being a Java based application, uses Unicode. JSP 
Page can specify it's *output* encoding and it should match whatever 
browser expects. Tomcat *should* (I haven't checked) output HTTP headers 
to match the declared encoding. Additionally, you as a web page designer 
may specify a <meta ...> tag, to set your own encoding, but it will be 
ignored if the web server (Tomcat in this case) sets a HTTP header for 
character encoding. I've seen that on Apache, kept fixing our 
Windows-1250 pages to ISO-8859-1. The path of oyur data, in displaying 
case is:


JDBC driver should transform data into Unicode correctly, if DB encoding 
is OK and data is of the right encoding. This simply means that you 
cannot put, for instance, Windows-1250 data into Latin-1 database and 
expect it to come out OK. JVM will try to convert Unicode into requested 
output encoding, if it fails, the character will read "?".

For input path, situation is similar, with one catch. Not only the JSP 
or HTML page holding hte form can (and should) have a character 
encoding, but the HTML Form itself can have an encoding specified. Logic 
would sugest that if it is not specified, it should be inherited from 
the page. Logic fails on some browsers, so it would be prudent to 
specify it on the form as well.

The last step is informing the processing Servlet/JSP of the character 
encoding of the incoming data. That should be done by the browser, in 
case of POST request, I'm not sure what happens for GET requests. The 
browser should set HTTP headers of it's request (Form data submission). 
Of course there is a slight difference between "should", "must" and 
"will". :-)

> In addition, how does TC deal with framsets containing many html pages. Are
> they all treated individually (in theory allowing many character encodings
> to be used in each HTML frame), or as one unit.

TC deals with frames just as any other web servers does - it doesn't. 
HTML frames are a client side construction. Web servers don't care about 
them and do not notice them. Just as they don't care about multiple 
images in one single HTML page. A browser may request them, after it has 
gotten the page, or it may simply ignore them - the web server doesn't 
care. It will answer ANY request, be it HTML, JPEG or GIF, providing it 
is valid.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message