cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@upaya.co.uk>
Subject Re: [i18n] Replacing entities with encoded chars
Date Sun, 18 Apr 2004 07:45:19 GMT
Antonio Gallardo wrote:

>Hi:
>
>The Joerg answer is OK.
>I just added some additional info about the topic.
>  
>
Thanks for this Antonio. It all helps to make me a slightly less 
ignorant Englishman ;-)

But my question is: "I have a file that contains entity references. I 
want to replace it with direct characters, e.g. in UTF-8. How do I do 
this?" That is, this question really has nothing to do with Cocoon 
specifically. I want to change the format of my source files.

Regards, Upayavira

>Joerg Heinicke dijo:
>  
>
>>On 17.04.2004 22:33, Upayavira wrote:
>>
>>    
>>
>>>A few I18N questions:
>>>
>>>1) I have some polish text that uses character entities, such as
>>>"się" How can I translate this into a single or double byte
>>>character in either ISO-8859-1, ISO-8859-2 or UTF-8?
>>>      
>>>
>
>You can avoid the &#xxx; using UTF-8. UTF-8 allow you to write the
>representation of the char directly in the file. For example in Spanish:
>
>I really don't know the &#xxx; syntax for the following chars:
>
>á, ñ, Ñ, é, ö, ü, etc.
>
>I just write them directly in the file (as above). It is easier to me. It
>is because I use UTF-8 in the XML files. This is the gain.
>
>  
>
>>They are not translated. While you have the entity representation in the
>>XML files, you have characters in Java. Only the serializer decides
>>whether it puts them out as character or character entity. In general
>>this can't be influenced, but the one or the other serializer might have
>>configuration options for this. But at the end (i.e. in the browser or
>>where ever) it should work for both the entity and the character as they
>>represent the same "thing".
>>    
>>
>
>Yep. I recommend to use UTF-8 whenever is posible:
>http://marc.theaimsgroup.com/?l=xml-cocoon-users&m=106142759328759&w=2
>
>  
>
>>>2) I can set the encoding of a page in the serialiser configuration. How
>>>do I deal with the situation where the best encoding depends upon the
>>>language, which means that the encoding should be chosen based upon the
>>>encoding of a source file?
>>>      
>>>
>
>Again, try to use UTF-8. Is the best bet.
>
>  
>
>>That's not possible. As written above you have more or less
>>encoding-neutral characters in Java (obviously not completely as
>>somewhere in the memory they are also just bytes).
>>    
>>
>
>Yep, Java uses UTF-8 as the internal representation of Strings.
>
>  
>
>>But at least they are
>>independent on the encoding of the original file. You do not know in
>>which encoding the XML file was.
>>    
>>
>
>Yes, the parser make the conversion for you. The parser read the @encoding
>in:
>
><?xml version="1.0" encoding="XXX"?>
>
>where XXX is the encoding of the file.
>
>***************************************************************
>Note: From the XML specs, if you avoid the @encoding, by default encoding
>is UTF-8. Example:
>
><?xml version="1.0"?>
>***************************************************************
>
>You need to be aware also that writing in the XML header the @encoding is
>not enough. It is just a declaration. You need to make sure that the
>Operating System is using the right encoding while saving the file to
>disk. For this purpuse I prefer to use a jEdit - http://www.jEdit.org/
>that always tell me the encoding used to read/write the file.
>
>Of course there are other editors that allow you define the encoding.
>While begining in Cocoo, I really had nightmares, because the transition
>to UTF-8 concided with my first steps and what worked fine in RedHat 7.3
>was not OK in RedHat 8. And we changed the OS between the development. The
>answer was that RH8 uses UTF-8 as default while RH 7.3 not. The world is
>moving to UTF-8 and we need to try to use it everywhere.
>
>                                 -0-
>
>I believe that keeping all the processing pipeline in the same encoding
>avoid you problems and is more efficient, since the system don't need to
>make conversions between encoding that end in not desired string
>representations.
>
>For example:
>XML in ISO-8859-1
>Serialize in ISO8859-1
>
>In fact we have 2 conversions there:
>
>ISO-8859-1 -> UTF-8 while loading in Java
>UTF-8 -> ISO-8859-1 while serializing from Java
>Keep in mind Java always use UTF-8 as default
>(Here I need to explain a little more. In fact Java in memory render to
>UTF-16. That means a 2-bytes for each char in memory.
>
>  
>
>>You have to decide the serializer's
>>encoding only based on the possible character range. If it's strewed
>>over the ISO char sets better use UTF-8 in general. Another option would
>>be to use a selector based on user's locale which chooses the serializer
>>(with a specific encoding).
>>    
>>
>
>Another issue you need to keep in mind is that Cocoon is a servlet and the
>servlet container (Tomcat. jetty) have the "last word". You need to
>"synchronize it too". In particular there you will find 2 params in your
>web.xml:
>
><!--
>  Set encoding used by the container. If not set the ISO-8859-1 encoding
>  will be assumed.
>-->
>
><init-param>
>  <param-name>container-encoding</param-name>
>  <param-value>utf-8</param-value>
></init-param>
>
><!--
>  Set form encoding. This will be the character set used to decode request
>  parameters. If not set the ISO-8859-1 encoding will be assumed.
>-->
>
><init-param>
>  <param-name>form-encoding</param-name>
>  <param-value>utf-8</param-value>
></init-param>
>
>You will find problem if Cocoon will serialize ISO-8859-1 and your servlet
>UTF-8. The same issue can be show even in httpd servers when pages are
>saved on the disk using ISO-8859-1 and your httpd server is setted to use
>UTF-8.
>
>  
>
>>>Thanks for your help!
>>>      
>>>
>
>Me too.
>
>  
>
>>Those i18n ignorant English men! ;-)
>>    
>>
>
>lol. Not English man, but still, I am! :-DD
>
>Best Regards,
>
>Antonio Gallardo
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>For additional commands, e-mail: users-help@cocoon.apache.org
>
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Mime
View raw message