cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira>
Subject Re: [i18n] Replacing entities with encoded chars
Date Sun, 18 Apr 2004 05:32:02 GMT
Joerg Heinicke wrote:

> On 17.04.2004 22:33, Upayavira wrote:
>> A few I18N questions:
>> 1) I have some polish text that uses character entities, such as 
>> "się" How can I translate this into a single or double byte 
>> character in either ISO-8859-1, ISO-8859-2 or UTF-8?
> They are not translated. While you have the entity representation in 
> the XML files, you have characters in Java. Only the serializer 
> decides whether it puts them out as character or character entity. In 
> general this can't be influenced, but the one or the other serializer 
> might have configuration options for this. But at the end (i.e. in the 
> browser or where ever) it should work for both the entity and the 
> character as they represent the same "thing".

Ah. But if I want to convert the entities into characters as a one-off 
offline event (e.g in a text editor, or perl script)?

>> 2) I can set the encoding of a page in the serialiser configuration. 
>> How do I deal with the situation where the best encoding depends upon 
>> the language, which means that the encoding should be chosen based 
>> upon the encoding of a source file?
> That's not possible. As written above you have more or less 
> encoding-neutral characters in Java (obviously not completely as 
> somewhere in the memory they are also just bytes). But at least they 
> are independent on the encoding of the original file. You do not know 
> in which encoding the XML file was. You have to decide the 
> serializer's encoding only based on the possible character range. If 
> it's strewed over the ISO char sets better use UTF-8 in general. 
> Another option would be to use a selector based on user's locale which 
> chooses the serializer (with a specific encoding).

So UTF-8 is a good encoding to use, it sounds like. So, if I have 
multiple languages, it is best to aim for UTF-8 as a source encoding, 
and serialize to that.

But, if I have got these characters as characters not entities, then I 
could encode it as iso-8859-1, and serialize as UTF-8, and the necessary 
translation would happen?

> Those i18n ignorant English men! ;-)

Yup. But at least this one's willing to learn (at last!)

Regards, Upayavira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message