cocoon-docs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [DAISY] Updated: How to configure consistent encoding in Cocoon
Date Fri, 11 May 2007 10:12:48 GMT
A document has been updated:

Document ID: 1366
Branch: main
Language: default
Name: How to configure consistent encoding in Cocoon (previously How to configure UTF-8 encoding
for I18N everywhere)
Document Type: Cocoon Document (unchanged)
Updated on: 5/11/07 10:12:36 AM
Updated by: Alexander Klimetschek

A new version has been created, state: draft


This part has been updated.
Mime type: text/xml (unchanged)
File name:  (unchanged)
Size: 17221 bytes (previous version: 21105 bytes)
Content diff:
--- <h2 id="head-7be1dfafacbc6fb8e02d38cb177abb4a2030defc">How to configure UTF-8
--- encoding for I18N everywhere</h2>
    <p>The best for internationalization is to handle everything in UTF-8, since
    this is probably the most intelligent encoding available out there. Everything
    means server side (Backend, XML), HTTP Requests/Responses and client side with
--- forms and</p>
+++ forms and If you need another encoding, simply replace all
+++ occurrences of UTF-8 with that one, but note that this guide was only tested
+++ with UTF-8, other encodings might not be supported at all places.</p>
    <h4 id="head-b0e1772fd963c0cc72ccf58d5cada0c5797046c0">1. Sending all pages in
(28 equal lines skipped)
    <p>If you use CForms with ajax enabled, Cocoon will make use of
--- under the hood, which creates
--- XML<a href="">HttpRequests</a> that
--- POST the form data to the server. Here Dojo decides the encoding by default,
--- which does not match the browser's behaviour of using the charset defined in the
--- META tag. But you can easily tell Dojo which formatting to use for all
--- calls, just include that in the top of your HTML pages, before
--- dojo.js is included:</p>
+++ under the hood, which creates XMLHttpRequests that POST the form data to the
+++ server. Here Dojo decides the encoding by default, which does not match the
+++ browser's behaviour of using the charset defined in the META tag. But you can
+++ easily tell Dojo which formatting to use for all calls, just
+++ include that in the top of your HTML pages, before dojo.js is included:</p>
    <pre>&lt;script&gt;djConfig = { bindEncoding: "utf-8" };&lt;/script&gt;
(114 equal lines skipped)
    <a href="">UseCocoonXMLSerializerCode</a>
--- <h3 id="head-51c043008b794ccad3f9e792e0b028ec79d95993">Older documentation</h3>
+++ <h2>Further information</h2>
--- <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Basics</h4>
+++ <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Browser encoding basics
+++ </h4>
+++ <h5>Getting pages</h5>
    <p>If your Cocoon application needs to read request parameters that could
    contain <em>special</em> characters, i.e. characters outside of the first
    ASCII characters, you'll need to pay attention to what encoding is used.</p>
(4 equal lines skipped)
    can change the encoding, but it's quite safe to assume he/she won't do that
    (have you ever done it?).</p>
--- <p><em>In my browser this is the case, it is set in the preferences to
--- ISO-8859-1 and he encodes form parameters with that, regardless of the UTF-8
--- content type of the page containing the form. I can't remember when I did set
--- this property... So what to do with this case? This means, it could be any
--- encoding.</em> --
--- <a href="">AlexanderKlimetschek</a>
--- </p>
+++ <p>The browser will either read the encoding from either the &lt;meta&gt;
+++ inside the HTML &lt;head&gt;:</p>
--- <p>After doing some tests with popular browsers, I've noticed that usually
--- browsers will not let the server know what encoding they used to encode the
--- parameters, so we need to make sure ourselves that the encoding used when
--- serializing pages corresponds to the encoding used when decoding request
--- parameters.</p>
--- <p>First of all, check in the sitemap what encoding is used when serializing
--- HTML pages: &lt;encoding&gt;UTF-8&lt;/encoding&gt;</p>
--- <pre>&lt;map:serializer logger="sitemap.serializer.html" mime-type="text/html"
---        name="html" pool-grow="4" pool-max="32" pool-min="4"
---        src="org.apache.cocoon.serialization.HTMLSerializer"&gt;
---   &lt;buffer-size&gt;1024&lt;/buffer-size&gt;
---   &lt;encoding&gt;UTF-8&lt;/encoding&gt;
--- &lt;/map:serializer&gt;
+++ <pre>&lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8"&gt;
--- <p>In the example above, UTF-8 is the encoding used. This is a widely supported
--- Unicode encoding, so it is often a good choice.</p>
+++ <p>or from the HTTP Header Content-Type:</p>
--- <p>The HTML serializer will automatically insert a &lt;meta&gt; tag into
--- HTML page's HEAD element specifying the encoding. Most browsers apparently
--- require this. The HTML serializer will however only do this if your page already
--- contains a HEAD (or head) element, so make sure it has one. The &lt;meta&gt;
--- element inserted by the serializer will then look as follows:</p>
--- <pre>&lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8"&gt;
+++ <pre>Content-Type: text/html; charset=UTF-8
--- <p>Mozilla (tested with 1.4), netscape 7.1 and Internet Explorer 6 will not
--- respond to the setting of this meta tag, whereas they do respond to the http
--- response header "Content-Type". So you may have to subclass the HTMLSerializer
--- and let it add this header in order to get Mozilla and IE working.<br/>
--- -- <em>Someone added this last paragraph here. Good advice (haven't found time
--- to verify it yet though), but if this is the case we should fix this in Cocoon.
--- Patches welcome in bugzilla.
--- (<a href="">BrunoDumon</a>).</em><br/>
--- -- <em>I can confirm it and the effect is obvious when using a recent Tomcat
--- (&gt; 4.1.27):
--- <a href=""><img width="11"
height="11" src=""/>
--- Bug #26997</a>. But AFAIK the above must read 'will not respond to the setting
--- of this meta tag <strong>if</strong> the encoding/charset in the "Content-Type"
--- header is set' and Cocoon's problem is, that it does not set the
--- encoding/charset and the recent Tomcats sets it to default ISO-8859-1.
--- (<a href="">JoergHeinicke</a>)</em>
--- <br/>
--- -- <em>But you can make Cocoon set the header by configuring the serializer with
--- the correct mime-type information: </em></p>
+++ <p>One has to include both to support all browsers. This will be done by the
+++ HTML serializer if you configure it with the parameters mime-type and encoding,
+++ as stated above.</p>
--- <ul>
--- <li>
--- <pre>&lt;map:serializer name="html" mime-type="text/html; charset=utf-8"
---        src="org.apache.cocoon.serialization.HTMLSerializer"
---        logger="sitemap.serializer.html" 
---        pool-grow="4" pool-max="32" pool-min="4"&gt;
---   &lt;buffer-size&gt;1024&lt;/buffer-size&gt;
---   &lt;encoding&gt;UTF-8&lt;/encoding&gt;
--- &lt;/map:serializer&gt;</pre>
--- </li>
--- </ul>
+++ <h5>Sending form data</h5>
--- <p>The first <tt>charset=utf-8</tt> is needed for the HTTP header whereas
--- <tt>&lt;encoding&gt;UTF-8&lt;/encoding&gt;</tt> seems to be
responsible for the
--- encoding only of the document's content. (Volkmar W. Pogatzki)</p>
    <p>By default, if the browser doesn't explicitely mention the encoding, a
    servlet container will decode request parameters using the ISO-8859-1 encoding
    (independent of the platform on which the container is running). So in the above
--- case where UTF-8 was used when serializing, we would be facing problems.</p>
--- <p><em>Note: Jetty uses
--- [<a href=""><img
width="11" height="11" src=""/>
--- UTF-8 as default for decoding form parameters</a>]! So you have to use the
--- <tt>SetCharacterEncodingFilter</tt> (see below) to set the encoding for Jetty
--- ISO-8859-1 if this is what the browser sends.</em>
--- --<a href="">AlexanderKlimetschek</a>
+++ case where UTF-8 was used when serializing, we would be facing problems. An
+++ exception, that might hide the problem and which you will face when you use the
+++ handy mvn jetty:run to run your Cocoon application, is that Jetty uses UTF-8 by
+++ default. It does not adhere to the servlet container standard here. So you can
+++ configure your container with the default encoding you want (e.g. UTF-8), if
+++ that is possible, or you must use a solution like the
+++ <a href="">SetCharacterEncodingFilter</a>.
--- <p>The encoding to use when decoding request parameters can be configured in the
--- web.xml by supplying init parameters called "form-encoding" and
--- "container-encoding" to the Cocoon servlet. The container-encoding parameter
--- indicates according to what encoding the container tried to decode the request
--- parameters (normally ISO-8859-1), and the form-encoding parameter indicates the
--- actual encoding. Here's an example of how to specify the parameters in the
--- web.xml:</p>
+++ <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Request parameter
+++ encoding in Cocoon</h4>
--- <pre>&lt;init-param&gt;
---   &lt;param-name&gt;container-encoding&lt;/param-name&gt;
---   &lt;param-value&gt;ISO-8859-1&lt;/param-value&gt;
--- &lt;/init-param&gt;
--- &lt;init-param&gt;
---   &lt;param-name&gt;form-encoding&lt;/param-name&gt;
---   &lt;param-value&gt;UTF-8&lt;/param-value&gt;
--- &lt;/init-param&gt;
--- </pre>
    <p>For Java-insiders: what Cocoon actually does internally is apply the
    following trick to get a parameter correctly decoded: suppose "value" is a
    string containing a request parameter, then Cocoon will do:</p>
(151 equal lines skipped)

View raw message