cocoon-docs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From da...@cocoon.zones.apache.org
Subject [DAISY] Created: How to configure UTF-8 encoding for I18N everywhere
Date Fri, 11 May 2007 09:49:06 GMT
A new document has been created.

http://cocoon.zones.apache.org/daisy/documentation/1366.html

Document ID: 1366
Branch: main
Language: default
Name: How to configure UTF-8 encoding for I18N everywhere
Document Type: Cocoon Document
Created: 5/11/07 9:48:51 AM
Creator (owner): Alexander Klimetschek
State: draft

Parts
=====

Content
-------
Mime type: text/xml
Size: 21105 bytes
Content:
<html>
<body>

<h2 id="head-7be1dfafacbc6fb8e02d38cb177abb4a2030defc">How to configure UTF-8
encoding for I18N everywhere</h2>

<p>The best for internationalization is to handle everything in UTF-8, since
this is probably the most intelligent encoding available out there. Everything
means server side (Backend, XML), HTTP Requests/Responses and client side with
forms and dojo.io.bind.</p>

<h4 id="head-b0e1772fd963c0cc72ccf58d5cada0c5797046c0">1. Sending all pages in
UTF-8</h4>

<p>You need to configure Cocoon's serializers to UTF-8. The XML serializer
(<tt>&lt;serialize type="xml" /&gt;</tt>) and the HTML serializer
(<tt>&lt;serialize type="html" /&gt;</tt>) need to be configured. To
support all
browsers, you must state the encoding to be used for the body and also include a
meta tag in the html:
<tt>&lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8"&gt;</tt>.
This is very important, since the browser will then send form requests encoded
in UTF-8 (and browsers normaly don't mention the encoding in the request, so you
have to assume they are doing it right). Here is the configuration for the
serializer components for your sitemaps that will do that:</p>

<pre>&lt;serializer name="xml" mime-type="text/xml"
  src="org.apache.cocoon.serialization.XMLSerializer"&gt;
  &lt;encoding&gt;UTF-8&lt;/encoding&gt;
&lt;/serializer&gt;

&lt;serializer name="html" mime-type="text/html; charset=UTF-8"
  src="org.apache.cocoon.serialization.HTMLSerializer"&gt;
  &lt;encoding&gt;UTF-8&lt;/encoding&gt;

  &lt;!-- the following common doctype is only included for completeness, it has no impact
on encoding --&gt;
  &lt;doctype-public&gt;-//W3C//DTD HTML 4.01 Transitional//EN&lt;/doctype-public&gt;
  &lt;doctype-system&gt;http://www.w3.org/TR/html4/loose.dtd&lt;/doctype-system&gt;
&lt;/serializer&gt;
</pre>

<h4 id="head-58d760c1af59400884a162695db4e7ab167ec0ac">2. AJAX Requests with
CForms/Dojo</h4>

<p>If you use CForms with ajax enabled, Cocoon will make use of dojo.io.bind()
under the hood, which creates
XML<a href="http://wiki.apache.org/cocoon/HttpRequests">HttpRequests</a> that
POST the form data to the server. Here Dojo decides the encoding by default,
which does not match the browser's behaviour of using the charset defined in the
META tag. But you can easily tell Dojo which formatting to use for all
dojo.io.bind() calls, just include that in the top of your HTML pages, before
dojo.js is included:</p>

<pre>&lt;script&gt;djConfig = { bindEncoding: "utf-8" };&lt;/script&gt;
</pre>

<p>You might already have other djConfig options, then simply add the
<tt>bindEncoding</tt> property to the hash map.</p>

<h4 id="head-2e259f641e7e2f53c8cafa65d6863e85602340c8">3. Decoding incoming
requests: Servlet Container</h4>

<p>When the browser sends stuff to your server, eg. form data, the
<tt>ServletRequest</tt> will be created by your servlet container, which needs
to decode the parameters correctly into Java Strings. If there is the encoding
specified in the HTTP request header, he will use that, but unfortunately this
is typically not the case. When the browser sends a form post, he will only say
<tt>application/x-www-form-urlencoded</tt> in the header. So you have to assume
the encoding here, and the right thing to assume is the encoding of the page you
originally sent to the browser.</p>

<p>The servlet standard says that the default encoding for incoming requests
should be ISO-8859-1 (Jetty is not according to the standard here, it assumes
UTF-8 by default). So to make sure UTF-8 is used for the parameter decoding, you
have to tell your servlet that encoding explicitly. This is done by calling
<tt>ServletRequest.setCharacterEncoding()</tt>. To do that for all your
requests, you can use a servlet filter like this one:
<a href="http://wiki.apache.org/cocoon/SetCharacterEncodingFilter">SetCharacterEncodingFilter</a>.
</p>

<p>Then you add the filter to the web.xml:</p>

<pre>&lt;filter&gt;
  &lt;filter-name&gt;Set Character Encoding&lt;/filter-name&gt;
  &lt;filter-class&gt;filters.SetCharacterEncodingFilter&lt;/filter-class&gt;
  &lt;init-param&gt;
    &lt;param-name&gt;encoding&lt;/param-name&gt;
    &lt;param-value&gt;UTF-8&lt;/param-value&gt;
  &lt;/init-param&gt;
&lt;/filter&gt;

&lt;!-- either mapping to URL pattern --&gt;

&lt;filter-mapping&gt;
  &lt;filter-name&gt;Set Character Encoding&lt;/filter-name&gt;
  &lt;url-pattern&gt;/*&lt;/url-pattern&gt;
&lt;/filter-mapping&gt;

&lt;!-- or mapping to your Cocoon servlet (the servlet-name might be different) --&gt;

&lt;filter-mapping&gt;
  &lt;filter-name&gt;SetCharacterEncoding&lt;/filter-name&gt;
  &lt;servlet-name&gt;CocoonBlocksDispatcherServlet&lt;/servlet-name&gt;
&lt;/filter-mapping&gt;

</pre>

<p>Since the filter element was added in the servlet 2.3 specification, you need
at least 2.3 in your web.xml, but using the current 2.4 version is better, it's
the standard for Cocoon webapplications. For 2.4 you use a XSD schema:</p>

<pre>&lt;web-app version="2.4"
         xmlns="http://java.sun.com/xml/ns/j2ee"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd"&gt;
</pre>

<p>For 2.3 you need to modify the DOCTYPE declaration in the web.xml:</p>

<pre>&lt;!DOCTYPE web-app
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
    "http://java.sun.com/dtd/web-app_2_3.dtd"&gt;
</pre>

<h4 id="head-535456e18a08cddc5006ce8cec7a6317373f0420">4. Setting Cocoon's
encoding (especially CForms)</h4>

<p>To tell Cocoon to use UTF-8 internally, you have to set 2 properties:</p>

<pre>org.apache.cocoon.containerencoding=utf-8
org.apache.cocoon.formencoding=utf-8
</pre>

<p>They need to be in some <tt>*.properties</tt> file under
<tt>META-INF/cocoon/properties</tt> in one of your blocks.</p>

<h4 id="head-5f8a0d453df12b9e2ea517bca5c8d03baa9ba131">5. XML Files</h4>

<p>This is normally not a problem, since the standard encoding for XML files is
UTF-8. However, they should always start with the following instruction, which
should force your XML Editor to save them in UTF-8 (it looks like most of them
do that, so there should not be a problem here).</p>

<pre>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
</pre>

<h4 id="head-e552435cf6a174a6eefbebb1baf13b99e3074abd">6. Special Transformers
</h4>

<p>The standard XSLT Transformers and others are working on SAX events, which
are not serialized, thus encoding is not a problem. But there are some special
transformers that pass stuff on to another library that does include
serialization and might need a hint to use the correct encoding. One problem is
for example the NekoHTMLTransformer:
<a href="https://issues.apache.org/jira/browse/COCOON-2063"><img width="11" height="11"
src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
https://issues.apache.org/jira/browse/COCOON-2063</a>.</p>

<p>If you think there might be a transformer doing things wrong in your
pipeline, add a <tt>TeeTransformer</tt> between each step, outputting the XML
between the transformers into temp1.xml, temp2.xml and so on to look for the
place where your umlaute and special characters are messed up.</p>

<h4 id="head-4ba9a12002e207573ddb09bba2c61b59dbc56a23">7. Your own XML
serializing Sources</h4>

<p>If you have your own Source implementation that needs to serialize XML, make
sure it will do that in UTF-8 as well. A good idea is to use Cocoon's XML
serializer, since we already configured that one to UTF-8 above. Sample code
that does that is here:
<a href="http://wiki.apache.org/cocoon/UseCocoonXMLSerializerCode">UseCocoonXMLSerializerCode</a>
</p>

<h3 id="head-51c043008b794ccad3f9e792e0b028ec79d95993">Older documentation</h3>

<h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Basics</h4>

<p>If your Cocoon application needs to read request parameters that could
contain <em>special</em> characters, i.e. characters outside of the first 128
ASCII characters, you'll need to pay attention to what encoding is used.</p>

<p>Normally a browser will send data to the server using the same encoding as
the page containing the submitted form (or whatever). So if the pages are
serialized using UTF-8, the browser will submit form data using UTF-8. The user
can change the encoding, but it's quite safe to assume he/she won't do that
(have you ever done it?).</p>

<p><em>In my browser this is the case, it is set in the preferences to
ISO-8859-1 and he encodes form parameters with that, regardless of the UTF-8
content type of the page containing the form. I can't remember when I did set
this property... So what to do with this case? This means, it could be any
encoding.</em> --
<a href="http://wiki.apache.org/cocoon/AlexanderKlimetschek">AlexanderKlimetschek</a>
</p>

<p>After doing some tests with popular browsers, I've noticed that usually
browsers will not let the server know what encoding they used to encode the
parameters, so we need to make sure ourselves that the encoding used when
serializing pages corresponds to the encoding used when decoding request
parameters.</p>

<p>First of all, check in the sitemap what encoding is used when serializing
HTML pages: &lt;encoding&gt;UTF-8&lt;/encoding&gt;</p>

<pre>&lt;map:serializer logger="sitemap.serializer.html" mime-type="text/html"
       name="html" pool-grow="4" pool-max="32" pool-min="4"
       src="org.apache.cocoon.serialization.HTMLSerializer"&gt;
  &lt;buffer-size&gt;1024&lt;/buffer-size&gt;
  &lt;encoding&gt;UTF-8&lt;/encoding&gt;
&lt;/map:serializer&gt;
</pre>

<p>In the example above, UTF-8 is the encoding used. This is a widely supported
Unicode encoding, so it is often a good choice.</p>

<p>The HTML serializer will automatically insert a &lt;meta&gt; tag into the
HTML page's HEAD element specifying the encoding. Most browsers apparently
require this. The HTML serializer will however only do this if your page already
contains a HEAD (or head) element, so make sure it has one. The &lt;meta&gt;
element inserted by the serializer will then look as follows:</p>

<pre>&lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8"&gt;
</pre>

<p>Mozilla (tested with 1.4), netscape 7.1 and Internet Explorer 6 will not
respond to the setting of this meta tag, whereas they do respond to the http
response header "Content-Type". So you may have to subclass the HTMLSerializer
and let it add this header in order to get Mozilla and IE working.<br/>
-- <em>Someone added this last paragraph here. Good advice (haven't found time
to verify it yet though), but if this is the case we should fix this in Cocoon.
Patches welcome in bugzilla.
(<a href="http://wiki.apache.org/cocoon/BrunoDumon">BrunoDumon</a>).</em><br/>
-- <em>I can confirm it and the effect is obvious when using a recent Tomcat
(&gt; 4.1.27):
<a href="http://issues.apache.org/bugzilla/show_bug.cgi?id=26997"><img width="11"
height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
Bug #26997</a>. But AFAIK the above must read 'will not respond to the setting
of this meta tag <strong>if</strong> the encoding/charset in the "Content-Type"
header is set' and Cocoon's problem is, that it does not set the
encoding/charset and the recent Tomcats sets it to default ISO-8859-1.
(<a href="http://wiki.apache.org/cocoon/JoergHeinicke">JoergHeinicke</a>)</em>
<br/>
-- <em>But you can make Cocoon set the header by configuring the serializer with
the correct mime-type information: </em></p>

<ul>
<li>
<pre>&lt;map:serializer name="html" mime-type="text/html; charset=utf-8"
       src="org.apache.cocoon.serialization.HTMLSerializer"
       logger="sitemap.serializer.html" 
       pool-grow="4" pool-max="32" pool-min="4"&gt;
  &lt;buffer-size&gt;1024&lt;/buffer-size&gt;
  &lt;encoding&gt;UTF-8&lt;/encoding&gt;
&lt;/map:serializer&gt;</pre>
</li>
</ul>

<p>The first <tt>charset=utf-8</tt> is needed for the HTTP header whereas
<tt>&lt;encoding&gt;UTF-8&lt;/encoding&gt;</tt> seems to be responsible
for the
encoding only of the document's content. (Volkmar W. Pogatzki)</p>

<p>By default, if the browser doesn't explicitely mention the encoding, a
servlet container will decode request parameters using the ISO-8859-1 encoding
(independent of the platform on which the container is running). So in the above
case where UTF-8 was used when serializing, we would be facing problems.</p>

<p><em>Note: Jetty uses
[<a href="http://docs.codehaus.org/display/JETTY/International+Characters+and+Character+Encodings"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
UTF-8 as default for decoding form parameters</a>]! So you have to use the
<tt>SetCharacterEncodingFilter</tt> (see below) to set the encoding for Jetty
to
ISO-8859-1 if this is what the browser sends.</em>
--<a href="http://wiki.apache.org/cocoon/AlexanderKlimetschek">AlexanderKlimetschek</a>
</p>

<p>The encoding to use when decoding request parameters can be configured in the
web.xml by supplying init parameters called "form-encoding" and
"container-encoding" to the Cocoon servlet. The container-encoding parameter
indicates according to what encoding the container tried to decode the request
parameters (normally ISO-8859-1), and the form-encoding parameter indicates the
actual encoding. Here's an example of how to specify the parameters in the
web.xml:</p>

<pre>&lt;init-param&gt;
  &lt;param-name&gt;container-encoding&lt;/param-name&gt;
  &lt;param-value&gt;ISO-8859-1&lt;/param-value&gt;
&lt;/init-param&gt;
&lt;init-param&gt;
  &lt;param-name&gt;form-encoding&lt;/param-name&gt;
  &lt;param-value&gt;UTF-8&lt;/param-value&gt;
&lt;/init-param&gt;
</pre>

<p>For Java-insiders: what Cocoon actually does internally is apply the
following trick to get a parameter correctly decoded: suppose "value" is a
string containing a request parameter, then Cocoon will do:</p>

<pre>value = new String(value.getBytes("ISO-8859-1"), "UTF-8");
</pre>

<p>So it recodes the incorrectly decoded string back to bytes and decodes it
using the correct encoding.</p>

<h3 id="head-204b2134b5537d081ff4fc107e4b599afbf14804">Locally overriding the
form-encoding</h3>

<p>Cocoon is ideally suited for publishing to different kinds of devices, and it
may well be possible that for certain devices, it is required to use different
encodings. In this case, you can redefine the form-encoding for specific
pipelines using the SetCharacterEncodingAction.</p>

<p>To use it, first of all make sure the action is declared in the map:actions
element of the sitemap:</p>

<pre>&lt;map:action name="set-encoding" src="org.apache.cocoon.acting.SetCharacterEncodingAction"/&gt;
</pre>

<p>and then call the action at the required location as follows:</p>

<pre>&lt;map:act type="set-encoding"&gt;
  &lt;map:parameter name="form-encoding" value="some-other-encoding"/&gt;
&lt;/map:act&gt;
</pre>

<h3 id="head-62ca318f2aba2ceeb1cba2e2445e5dd60e40ae8e">Problems with components
using the original !HttpServletRequest (JSPGenerator, ...)</h3>

<p>Some components such as the JSPGenerator use the original HttpServletRequest
object, instead of the Cocoon Request object. In that case, the correct decoding
of request parameters will not happen (that is, if for example the JSP page
itself would read request parameters).</p>

<p>One possible solution would be to patch these components to use a wrapper
class that delegates all calls to the HttpServletRequest object, except for the
getParameter or getParameterValues methods, which should be delegated to
Cocoon's Request object.</p>

<p>There's an easier solution that can be applied right away if your servlet
container supports the Servlet 2.3 specification. Starting from 2.3, the Servlet
specification allows to explicitely set the encoding to be used for decoding
request parameters, though this has to happen before the first request data is
read. Since Cocoon reads request parameters itself (such as cocoon-reload), this
would require modification of the CocoonServlet. But it can also be done using a
servlet filter. Tomcat 4 contains just such a filter in its "examples" webapp.
Look for the file
jakarta-tomcat/webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java.
Compile it (with servlet.jar in the classpath), put it in a jar (using correct
package and such) and put the jar in your webapps WEB-INF/lib directory.</p>

<p>Now modify your webapp's web.xml file to include the following (after the
display-name and description elements, but before the servlet element):</p>

<pre>&lt;filter&gt;
  &lt;filter-name&gt;Set Character Encoding&lt;/filter-name&gt;
  &lt;filter-class&gt;filters.SetCharacterEncodingFilter&lt;/filter-class&gt;
  &lt;init-param&gt;
    &lt;param-name&gt;encoding&lt;/param-name&gt;
    &lt;param-value&gt;UTF-8&lt;/param-value&gt;
  &lt;/init-param&gt;
&lt;/filter&gt;

&lt;filter-mapping&gt;
  &lt;filter-name&gt;Set Character Encoding&lt;/filter-name&gt;
  &lt;url-pattern&gt;/*&lt;/url-pattern&gt;
&lt;/filter-mapping&gt;
</pre>

<p>Since the filter element is new in the servlet 2.3 specification, you might
need to modify the DOCTYPE declaration in the web.xml:</p>

<pre>&lt;!DOCTYPE web-app
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
    "http://java.sun.com/dtd/web-app_2_3.dtd"&gt;
</pre>

<p>Of course, when using a servlet filter to set the encoding, you should not
supply the form-encoding init parameter anymore in the web.xml. You could still
supply the container-encoding parameter, though its value will now have to be
the same as the encoding supplied to the filter. This will allow you to override
the form-encoding using the SetCharacterEncodingAction, though only for the
Cocoon Request object.</p>

<p>Using a servlet filter also has the advantage that it will work for any
servlet. Suppose your webapp consists of multiple servlets, with Cocoon being
only one of them. Sometimes the processing could start in another servlet (which
sets the character encoding correctly) and then be forwarded to Cocoon, while
other times the processing could start immediately in the Cocoon servlet. It
would then be impossible to know in Cocoon whether the request parameter
encoding needs to be corrected or not.</p>

<h3 id="head-c284f762c307023b1f26eb18589fcfcd71c196e1">Operating System
Preliminaries</h3>

<p>Working with non-english characters may also pose problems depending on the
operations system settings, as the JVM seems to detect the system language. If,
e.g., german umlauts should be correctly processed with Cocoon on Linux, it is
required to set the LANG environment variable to de like this:</p>

<p><tt>export LANG=de</tt></p>

<p><em>The remark in this last paragraph won't have any influence on request
parameter decoding, though it might help for other things (reading text files,
communication with database, ...)</em> --
<a href="http://wiki.apache.org/cocoon/BrunoDumon">BrunoDumon</a></p>

<p>(That's one of several ways of setting the JVM locale, see also
<a href="http://wiki.apache.org/cocoon/SettingTheJvmLocale">SettingTheJvmLocale</a>).
</p>

<p><em>Just came across this today:
<a href="http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset</a>
This looks like related stuff we should investigate. (and test the current
browsers for)</em> --
<a href="http://wiki.apache.org/cocoon/MarcPortier">MarcPortier</a></p>

<h3 id="head-c34db4c0500be7b68ebc1b5b192295649a3ee14d">More readings</h3>

<ul>
<li>
<p>
<a href="http://marc.theaimsgroup.com/?t=106760662600010&amp;r=1&amp;w=2"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
cocoon's defaults form-encoding and seerialize-encoding</a>
<a href="http://wiki.apache.org/cocoon/MarcPortier">MarcPortier</a> proposal to
remove inconsitencies in the way Cocoon handles the encoding of serialized text
and request-parameter decoding.
<a href="http://marc.theaimsgroup.com/?l=xml-cocoon-dev&amp;m=106772461923197&amp;w=2"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
This</a> is a good summary of the thread.</p>
</li>
</ul>

<h3 id="head-78c28b1a6fbe166e5ede513c661c0949d801a1e5">What about file names for
uploaded files?</h3>

<p>I saw some reports on users list about file names being <em>wrongly</em>
encoded, i.e. though everything else on a form works, the file name of an
uploaded file is wrong when <em>special</em> characters were used. I never
tested it though.
(<a href="http://wiki.apache.org/cocoon/JoergHeinicke">JoergHeinicke</a>)<br/>
The
<a href="http://issues.apache.org/bugzilla/show_bug.cgi?id=24289"><img width="11"
height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
Bug 24289</a> - "MultipartParser cannot handle multibyte character in uploaded
file name correctly" might explain it.
(<a href="http://wiki.apache.org/cocoon/JoergHeinicke">JoergHeinicke</a>)</p>

</body>
</html>

Collections
===========
The document belongs to the following collections: cdocs-site-main

Mime
View raw message