httpd-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier ...@ice-sa.com>
Subject Re: [users@httpd] Wrong charset convert SOLVED
Date Wed, 01 Jul 2009 19:56:44 GMT
Jiří Eichler wrote:
> I didn't program MediaWiki, but on Wikipedia it seems to be working 
> well. I just realize that we haven't solved that problem with charset, I 
> have just changed charset sent by php ... you're right with "double 
> encoding" to utf-8, Apache/php think that it is something else and 
> encode it once more. But how can we tell php that it is in utf-8? I 
> don't know. :-D    Can it be called bug when there is no way to detect 
> charset of uploaded filename?
> 
Well...
One basic problem is that there are "holes" in the HTTP 1.x 
specification, at least when considering the multi-lingual, increasingly 
Unicode-centric world in which we are living.
The next problem is that browsers do not always respect even the HTTP 
1.x specification.
The third problem is that not all browsers fail to respect it in the 
same way (but they are getting better at this).
The next issue is that, the WWW being like it is, with clients that the 
server does not control, you can never be sure of anything.
The next issue is that programming languages like PHP, do not 
necessarily offer very good tools to "mark" a string as being in any 
particular encoding.
Another issue is that it is relatively easy to check if a received text 
is valid UTF-8; but it is very hard to check if a received text is valid 
iso-8859-1 or iso-8859-2 or cp-1250, or any of the 8-bit character sets; 
and it is even harder to find out which one of them it is.

And one overall issue, is that it is not always easy to change any of 
the above, without suddenly breaking many WWW applications.

Taking all the above into account however, there are some things which 
you can do in your applications, to minimise the consequences.

One first thing is to be correct, consistent, and precise in what you 
send to the browser.
("Be strict in what you send, and tolerant in what you receive")

So if you have chosen Unicode/UTF-8 for your basic charset and encoding 
(the best choice nowadays), make sure that :
- each time your server sends some text page to the client, there is a 
proper "Content-type: xxx/yyyy; charset=utf-8" HTTP header with the 
response (see *1 below)
- each time your server sends some HTML or XML page to the client, make 
sure that it has an explicit charset declaration inside
- always verify that your pages *are* encoded in UTF-8.  Not that 
someone has been editing your pages using an old editor, which knows 
only iso-latin-2 or cp-1250.
- when you send a <form> to the client, specify the
accept-charset="utf-8" attribute in the <form> tag
- when you send a <form> to the client, which will be later submitted 
back, include some
<input name="test-encoding" type="hidden" value="xxxxxxxxxx">
where "xxxxxxxxxx" is a valid UTF-8 string containing non US-ASCII 
characters.
Then, in the script that receives the data from this form, test this 
parameter, to see if what you received is indeed UTF-8 or not.
The way to do that varies depending on the programming language.
(Maybe you can compare the length in bytes and/or the length in 
characters, or compare it with an internal identical string known to be 
UTF-8.)
- be "defensive" in your cgi-bin scripts. Everything you receive from 
the client is suspect.
- never forget that on the WWW, "the client is king". The user /can/ 
change the charset of his browser, no matter what the server tells it.
(Firefox 3.1 : View..Character encoding; IE 7 : same)



(*1) :
when I use your PHP upload page, the response page that I get from your 
server has these HTTP headers :
HTTP/1.1 200 OK
Date: Wed, 01 Jul 2009 19:44:31 GMT
Server: Apache/2.2.11 (Win32) PHP/5.2.8
X-Powered-By: PHP/5.2.8
Content-Length: 716
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=windows-1250


However, the html page itself contains :
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

That is /not/ consistent.

On the other hand, the index page received after you click on the /data 
link, has the following HTTP headers :

HTTP/1.1 200 OK
Date: Wed, 01 Jul 2009 19:54:01 GMT
Server: Apache/2.2.11 (Win32) PHP/5.2.8
Content-Length: 264
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html;charset=UTF-8



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Mime
View raw message