forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sjur Nørstebø Moshagen <>
Subject Re: Forrest and UTF-8
Date Tue, 11 May 2004 13:52:26 GMT
På 11. mai. 2004 kl. 16.20 skrev Upayavira:

> Sjur Nørstebø Moshagen wrote:
>> - Xalan, when serializing to HTML, will render all characters defined 
>> as entities in the HTML spec as entities (= ASCII) (the defined 
>> entities cover most of the non-ASCII part of the 8859-series, as well 
>> as other characters),
>> UTF-8 should be no problem for most of the browsers, even old ones. 
>> AND UTF-8 solves a lot of _other_ encoding problems in a multilingual 
>> world, of which many are just as problematic for old browsers as 
>> UTF-8.
>> I first perceived the Xalan behaviour as buggy, generating 
>> unnecessary large files in a UTF-8 setting (entitites use more space 
>> than a multibyte UTF-8 character), but considering backwards 
>> compatibility the behaviour is actually not so bad.
>> To sum up:
>> +1 - UTF-8 should be default, with alternative encodings available as 
>> an option.
> But, if Xalan does as you say, does the encoding make much difference?

Yes, it does. First of all, Xalan *only* converts those characters to 
entities that are listed in the HTML specs, which means that most of 
Unicode will *not* be covered. Secondly, even if Xalan would convert 
everything to entities (e.g. numeric ones), that is not very desirable, 
since each character would then occupy several bytes more than if 
served as a UTF-8 multibyte sequence: take the Polish ł (l with stroke, 
no entity defined), which as UTF-8 takes to bytes (0xC582), whereas the 
numeric entity takes 6 bytes (&#322;) in decimal notation, and 8 bytes 
(&#x0142;) in hexadecimal notation. If you have a document with many 
such characters, the size of the document can increase considerably. 
Also the readability of the raw HTML code is close to zero.

A further argument is that the conversion takes extra processing time, 
and is unnecessary if you want UTF-8 from source text all the way 
through to the browser. That is, unnecessary work.

All in all, for some of us entities are ok as a temporary solution as a 
backwards compatibility measure, but that pure UTF-8 will in many cases 
be the best, and should be the general goal.


PS. The complete set of defined entities can be found here:

View raw message