cocoon-docs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From da...@cocoon.zones.apache.org
Subject [DAISY] Updated: How to configure consistent encoding in Cocoon
Date Fri, 11 May 2007 10:47:29 GMT
A document has been updated:

http://cocoon.zones.apache.org/daisy/documentation/1366.html

Document ID: 1366
Branch: main
Language: default
Name: How to configure consistent encoding in Cocoon (unchanged)
Document Type: Cocoon Document (unchanged)
Updated on: 5/11/07 10:47:25 AM
Updated by: Alexander Klimetschek

A new version has been created, state: draft

Parts
=====

Content
-------
This part has been updated.
Mime type: text/xml (unchanged)
File name:  (unchanged)
Size: 14254 bytes (previous version: 17221 bytes)
Content diff:
    <html>
    <body>
    
--- <p>The best for internationalization is to handle everything in UTF-8, since
--- this is probably the most intelligent encoding available out there. Everything
--- means server side (Backend, XML), HTTP Requests/Responses and client side with
--- forms and dojo.io.bind. If you need another encoding, simply replace all
--- occurrences of UTF-8 with that one, but note that this guide was only tested
--- with UTF-8, other encodings might not be supported at all places.</p>
+++ <p>The best for internationalization, ie. support of umlaute, special
+++ characters, non-english languages, is to handle everything in UTF-8, since this
+++ is probably the most intelligent encoding available out there. If you need
+++ another encoding, simply replace all occurrences of UTF-8 with that one, but
+++ note that this guide was only tested with UTF-8, other encodings might not be
+++ supported at all places.</p>
    
--- <h4 id="head-b0e1772fd963c0cc72ccf58d5cada0c5797046c0">1. Sending all pages in
--- UTF-8</h4>
+++ <p>The following How-To covers the typical steps to achieve a consistent
+++ encoding everywhere. Some <a href="#theory">Background Information</a> can
be
+++ found at the end of this page.</p>
    
+++ <h3>1. Sending all pages in UTF-8</h3>
+++ 
    <p>You need to configure Cocoon's serializers to UTF-8. The XML serializer
    (<tt>&lt;serialize type="xml" /&gt;</tt>) and the HTML serializer
    (<tt>&lt;serialize type="html" /&gt;</tt>) need to be configured.
To support all
(20 equal lines skipped)
    &lt;/serializer&gt;
    </pre>
    
--- <h4 id="head-58d760c1af59400884a162695db4e7ab167ec0ac">2. AJAX Requests with
--- CForms/Dojo</h4>
+++ <h3>2. AJAX Requests with CForms/Dojo</h3>
    
    <p>If you use CForms with ajax enabled, Cocoon will make use of dojo.io.bind()
    under the hood, which creates XMLHttpRequests that POST the form data to the
(8 equal lines skipped)
    <p>You might already have other djConfig options, then simply add the
    <tt>bindEncoding</tt> property to the hash map.</p>
    
--- <h4 id="head-2e259f641e7e2f53c8cafa65d6863e85602340c8">3. Decoding incoming
--- requests: Servlet Container</h4>
+++ <h3>3. Decoding incoming requests: Servlet Container</h3>
    
    <p>When the browser sends stuff to your server, eg. form data, the
    <tt>ServletRequest</tt> will be created by your servlet container, which needs
(11 equal lines skipped)
    <tt>ServletRequest.setCharacterEncoding()</tt>. To do that for all your
    requests, you can use a servlet filter like this one:
    <a href="http://wiki.apache.org/cocoon/SetCharacterEncodingFilter">SetCharacterEncodingFilter</a>.
--- </p>
+++ Put this into one of your Cocoon blocks under
+++ <tt>src/main/java/my/package/filters/SetCharacterEncodingFilter</tt> so that
the
+++ class will be in a jar that lands in <tt>WEB-INF/lib</tt> and thus being
+++ availble for use in the web.xml configuration.</p>
    
    <p>Then you add the filter to the web.xml:</p>
    
    <pre>&lt;filter&gt;
      &lt;filter-name&gt;Set Character Encoding&lt;/filter-name&gt;
---   &lt;filter-class&gt;filters.SetCharacterEncodingFilter&lt;/filter-class&gt;
+++   &lt;filter-class&gt;my.package.filters.SetCharacterEncodingFilter&lt;/filter-class&gt;
      &lt;init-param&gt;
        &lt;param-name&gt;encoding&lt;/param-name&gt;
        &lt;param-value&gt;UTF-8&lt;/param-value&gt;
(33 equal lines skipped)
        "http://java.sun.com/dtd/web-app_2_3.dtd"&gt;
    </pre>
    
--- <h4 id="head-535456e18a08cddc5006ce8cec7a6317373f0420">4. Setting Cocoon's
--- encoding (especially CForms)</h4>
+++ <h3>4. Setting Cocoon's encoding (especially CForms)</h3>
    
    <p>To tell Cocoon to use UTF-8 internally, you have to set 2 properties:</p>
    
(2 equal lines skipped)
    </pre>
    
    <p>They need to be in some <tt>*.properties</tt> file under
--- <tt>META-INF/cocoon/properties</tt> in one of your blocks.</p>
+++ <tt>META-INF/cocoon/properties</tt> in one of your blocks. Note that the
+++ containerencoding must be the same as the one you specified in the
+++ SetCharacterEncodingFilter. But here we are using UTF-8 everywhere anyway.</p>
    
--- <h4 id="head-5f8a0d453df12b9e2ea517bca5c8d03baa9ba131">5. XML Files</h4>
+++ <h3>5. XML Files</h3>
    
    <p>This is normally not a problem, since the standard encoding for XML files is
    UTF-8. However, they should always start with the following instruction, which
(3 equal lines skipped)
    <pre>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
    </pre>
    
--- <h4 id="head-e552435cf6a174a6eefbebb1baf13b99e3074abd">6. Special Transformers
--- </h4>
+++ <h3>6. Special Transformers</h3>
    
    <p>The standard XSLT Transformers and others are working on SAX events, which
    are not serialized, thus encoding is not a problem. But there are some special
(8 equal lines skipped)
    between the transformers into temp1.xml, temp2.xml and so on to look for the
    place where your umlaute and special characters are messed up.</p>
    
--- <h4 id="head-4ba9a12002e207573ddb09bba2c61b59dbc56a23">7. Your own XML
--- serializing Sources</h4>
+++ <h3>7. Your own XML serializing Sources</h3>
    
    <p>If you have your own Source implementation that needs to serialize XML, make
    sure it will do that in UTF-8 as well. A good idea is to use Cocoon's XML
(2 equal lines skipped)
    <a href="http://wiki.apache.org/cocoon/UseCocoonXMLSerializerCode">UseCocoonXMLSerializerCode</a>
    </p>
    
--- <h2>Further information</h2>
+++ <h2><a>Further information</a></h2>
    
--- <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Browser encoding basics
--- </h4>
+++ <h3>Browser encoding basics</h3>
    
--- <h5>Getting pages</h5>
+++ <h4>Getting pages</h4>
    
    <p>If your Cocoon application needs to read request parameters that could
    contain <em>special</em> characters, i.e. characters outside of the first
128
(20 equal lines skipped)
    HTML serializer if you configure it with the parameters mime-type and encoding,
    as stated above.</p>
    
--- <h5>Sending form data</h5>
+++ <h4>Sending form data</h4>
    
    <p>By default, if the browser doesn't explicitely mention the encoding, a
    servlet container will decode request parameters using the ISO-8859-1 encoding
    (independent of the platform on which the container is running). So in the above
    case where UTF-8 was used when serializing, we would be facing problems. An
    exception, that might hide the problem and which you will face when you use the
    handy mvn jetty:run to run your Cocoon application, is that Jetty uses UTF-8 by
--- default. It does not adhere to the servlet container standard here. So you can
--- configure your container with the default encoding you want (e.g. UTF-8), if
--- that is possible, or you must use a solution like the
+++ default. It does not adhere to the servlet container standard here.</p>
+++ 
+++ <p>You either have to configure your container with the default encoding you
+++ want (e.g. UTF-8), if that is possible, or you must use a servlet-filter
+++ solution like the
    <a href="http://wiki.apache.org/cocoon/SetCharacterEncodingFilter">SetCharacterEncodingFilter</a>.
--- </p>
+++ Using a servlet filter also has the advantage that it will work for any servlet.
+++ Suppose your webapp consists of multiple servlets, with Cocoon being only one of
+++ them. Sometimes the processing could start in another servlet (which sets the
+++ character encoding correctly) and then be forwarded to Cocoon, while other times
+++ the processing could start immediately in the Cocoon servlet. It would then be
+++ impossible to know in Cocoon whether the request parameter encoding needs to be
+++ corrected or not (see below).</p>
    
--- <h4 id="head-1b11fc4db515f4d1e371f179c95f8b5fc78f93ac">Request parameter
--- encoding in Cocoon</h4>
+++ <h3>Request parameter decoding in Cocoon</h3>
    
+++ <h4>Fixing a wrong servlet container</h4>
+++ 
+++ <p>If you are not able to set the default encoding for your servlet container to
+++ what you actually want, it is possible to configure Cocoon to re-decode
+++ parameters properly. Suppose the servlet container has ISO-8859-1 default
+++ encoding set, but the requests from the browser are actually encoded in UTF-8.
+++ Then you can configure Cocoon with these properties:</p>
+++ 
+++ <pre>org.apache.cocoon.containerencoding=iso-8859-1
+++ org.apache.cocoon.formencoding=utf-8
+++ </pre>
+++ 
    <p>For Java-insiders: what Cocoon actually does internally is apply the
    following trick to get a parameter correctly decoded: suppose "value" is a
    string containing a request parameter, then Cocoon will do:</p>
(2 equal lines skipped)
    </pre>
    
    <p>So it recodes the incorrectly decoded string back to bytes and decodes it
--- using the correct encoding.</p>
+++ using the correct encoding. The first (ISO-8859-1 in the example) is the
+++ containerencoding, the second one the formencoding.</p>
    
--- <h3 id="head-204b2134b5537d081ff4fc107e4b599afbf14804">Locally overriding the
--- form-encoding</h3>
+++ <p>Not that this only works for core Cocoon concepts, eg. sitemaps, CForms and
+++ others accessing the request parameters. There are other components, eg. the
+++ JSPGenerator, that access the original HttpServletRequest object and thus do not
+++ get the correctly re-decoded parameter values (that is, if for example the JSP
+++ page itself would read request parameters). The only working solution seems to
+++ be the servlet-filter here.</p>
    
+++ <h4>Locally overriding the form-encoding</h4>
+++ 
    <p>Cocoon is ideally suited for publishing to different kinds of devices, and it
    may well be possible that for certain devices, it is required to use different
    encodings. In this case, you can redefine the form-encoding for specific
(12 equal lines skipped)
    &lt;/map:act&gt;
    </pre>
    
--- <h3 id="head-62ca318f2aba2ceeb1cba2e2445e5dd60e40ae8e">Problems with components
--- using the original !HttpServletRequest (JSPGenerator, ...)</h3>
+++ <h3>Operating System Preliminaries</h3>
    
--- <p>Some components such as the JSPGenerator use the original HttpServletRequest
--- object, instead of the Cocoon Request object. In that case, the correct decoding
--- of request parameters will not happen (that is, if for example the JSP page
--- itself would read request parameters).</p>
+++ <p>Not having influence on request parameter decoding, but sometimes making
+++ trouble with text files, database communication, etc. are operating system
+++ language settings. Working with non-english characters may pose problems, as the
+++ JVM seems to detect the system language. If, e.g., german umlauts should be
+++ correctly processed with Cocoon on Linux, it is required to set the LANG
+++ environment variable to de like this:</p>
    
--- <p>One possible solution would be to patch these components to use a wrapper
--- class that delegates all calls to the HttpServletRequest object, except for the
--- getParameter or getParameterValues methods, which should be delegated to
--- Cocoon's Request object.</p>
--- 
--- <p>There's an easier solution that can be applied right away if your servlet
--- container supports the Servlet 2.3 specification. Starting from 2.3, the Servlet
--- specification allows to explicitely set the encoding to be used for decoding
--- request parameters, though this has to happen before the first request data is
--- read. Since Cocoon reads request parameters itself (such as cocoon-reload), this
--- would require modification of the CocoonServlet. But it can also be done using a
--- servlet filter. Tomcat 4 contains just such a filter in its "examples" webapp.
--- Look for the file
--- jakarta-tomcat/webapps/examples/WEB-INF/classes/filters/SetCharacterEncodingFilter.java.
--- Compile it (with servlet.jar in the classpath), put it in a jar (using correct
--- package and such) and put the jar in your webapps WEB-INF/lib directory.</p>
--- 
--- <p>Now modify your webapp's web.xml file to include the following (after the
--- display-name and description elements, but before the servlet element):</p>
--- 
--- <pre>&lt;filter&gt;
---   &lt;filter-name&gt;Set Character Encoding&lt;/filter-name&gt;
---   &lt;filter-class&gt;filters.SetCharacterEncodingFilter&lt;/filter-class&gt;
---   &lt;init-param&gt;
---     &lt;param-name&gt;encoding&lt;/param-name&gt;
---     &lt;param-value&gt;UTF-8&lt;/param-value&gt;
---   &lt;/init-param&gt;
--- &lt;/filter&gt;
--- 
--- &lt;filter-mapping&gt;
---   &lt;filter-name&gt;Set Character Encoding&lt;/filter-name&gt;
---   &lt;url-pattern&gt;/*&lt;/url-pattern&gt;
--- &lt;/filter-mapping&gt;
--- </pre>
--- 
--- <p>Since the filter element is new in the servlet 2.3 specification, you might
--- need to modify the DOCTYPE declaration in the web.xml:</p>
--- 
--- <pre>&lt;!DOCTYPE web-app
---     PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
---     "http://java.sun.com/dtd/web-app_2_3.dtd"&gt;
--- </pre>
--- 
--- <p>Of course, when using a servlet filter to set the encoding, you should not
--- supply the form-encoding init parameter anymore in the web.xml. You could still
--- supply the container-encoding parameter, though its value will now have to be
--- the same as the encoding supplied to the filter. This will allow you to override
--- the form-encoding using the SetCharacterEncodingAction, though only for the
--- Cocoon Request object.</p>
--- 
--- <p>Using a servlet filter also has the advantage that it will work for any
--- servlet. Suppose your webapp consists of multiple servlets, with Cocoon being
--- only one of them. Sometimes the processing could start in another servlet (which
--- sets the character encoding correctly) and then be forwarded to Cocoon, while
--- other times the processing could start immediately in the Cocoon servlet. It
--- would then be impossible to know in Cocoon whether the request parameter
--- encoding needs to be corrected or not.</p>
--- 
--- <h3 id="head-c284f762c307023b1f26eb18589fcfcd71c196e1">Operating System
--- Preliminaries</h3>
--- 
--- <p>Working with non-english characters may also pose problems depending on the
--- operations system settings, as the JVM seems to detect the system language. If,
--- e.g., german umlauts should be correctly processed with Cocoon on Linux, it is
--- required to set the LANG environment variable to de like this:</p>
--- 
    <p><tt>export LANG=de</tt></p>
    
--- <p><em>The remark in this last paragraph won't have any influence on request
--- parameter decoding, though it might help for other things (reading text files,
--- communication with database, ...)</em> --
--- <a href="http://wiki.apache.org/cocoon/BrunoDumon">BrunoDumon</a></p>
--- 
--- <p>(That's one of several ways of setting the JVM locale, see also
--- <a href="http://wiki.apache.org/cocoon/SettingTheJvmLocale">SettingTheJvmLocale</a>).
+++ <p>That's one of several ways of setting the JVM locale, see also
+++ <a href="http://wiki.apache.org/cocoon/SettingTheJvmLocale">SettingTheJvmLocale</a>.
    </p>
    
--- <p><em>Just came across this today:
--- <a href="http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
--- http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset</a>
--- This looks like related stuff we should investigate. (and test the current
--- browsers for)</em> --
--- <a href="http://wiki.apache.org/cocoon/MarcPortier">MarcPortier</a></p>
+++ <h3>More readings</h3>
    
--- <h3 id="head-c34db4c0500be7b68ebc1b5b192295649a3ee14d">More readings</h3>
--- 
    <ul>
    <li>
    <p>
(5 equal lines skipped)
    <a href="http://marc.theaimsgroup.com/?l=xml-cocoon-dev&amp;m=106772461923197&amp;w=2"><img
width="11" height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
    This</a> is a good summary of the thread.</p>
    </li>
+++ <li>Cocoon does not support the HTTP request header
+++ <a href="http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset">Accept-Charset</a>,
+++ where the browser specifies a list of encodings he can handle. Maybe this might
+++ be useful to implement.</li>
    </ul>
    
--- <h3 id="head-78c28b1a6fbe166e5ede513c661c0949d801a1e5">What about file names for
--- uploaded files?</h3>
--- 
--- <p>I saw some reports on users list about file names being <em>wrongly</em>
--- encoded, i.e. though everything else on a form works, the file name of an
--- uploaded file is wrong when <em>special</em> characters were used. I never
--- tested it though.
--- (<a href="http://wiki.apache.org/cocoon/JoergHeinicke">JoergHeinicke</a>)<br/>
--- The
--- <a href="http://issues.apache.org/bugzilla/show_bug.cgi?id=24289"><img width="11"
height="11" src="http://wiki.apache.org/wiki/modern/img/moin-www.png"/>
--- Bug 24289</a> - "MultipartParser cannot handle multibyte character in uploaded
--- file name correctly" might explain it.
--- (<a href="http://wiki.apache.org/cocoon/JoergHeinicke">JoergHeinicke</a>)</p>
--- 
    </body>
    </html>


Mime
View raw message