forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Brondsema <d...@brondsema.net>
Subject Re: Forrest and UTF-8
Date Tue, 11 May 2004 13:58:07 GMT
On Tue, 11 May 2004, [ISO-8859-1] Sjur Nrsteb Moshagen wrote:

> På 11. mai. 2004 kl. 16.20 skrev Upayavira:
>
> > Sjur Nørstebø Moshagen wrote:
> >
> <snip/>
> >> - Xalan, when serializing to HTML, will render all characters defined
> >> as entities in the HTML spec as entities (= ASCII) (the defined
> >> entities cover most of the non-ASCII part of the 8859-series, as well
> >> as other characters),
> >>
> >> UTF-8 should be no problem for most of the browsers, even old ones.
> >> AND UTF-8 solves a lot of _other_ encoding problems in a multilingual
> >> world, of which many are just as problematic for old browsers as
> >> UTF-8.
> >>
> >> I first perceived the Xalan behaviour as buggy, generating
> >> unnecessary large files in a UTF-8 setting (entitites use more space
> >> than a multibyte UTF-8 character), but considering backwards
> >> compatibility the behaviour is actually not so bad.
> >>
> >> To sum up:
> >> +1 - UTF-8 should be default, with alternative encodings available as
> >> an option.
> >
> > But, if Xalan does as you say, does the encoding make much difference?
>
> Yes, it does. First of all, Xalan *only* converts those characters to
> entities that are listed in the HTML specs, which means that most of
> Unicode will *not* be covered. Secondly, even if Xalan would convert
> everything to entities (e.g. numeric ones), that is not very desirable,
> since each character would then occupy several bytes more than if
> served as a UTF-8 multibyte sequence: take the Polish ł (l with stroke,
> no entity defined), which as UTF-8 takes to bytes (0xC582), whereas the
> numeric entity takes 6 bytes (&#322;) in decimal notation, and 8 bytes
> (&#x0142;) in hexadecimal notation. If you have a document with many
> such characters, the size of the document can increase considerably.
> Also the readability of the raw HTML code is close to zero.
>
> A further argument is that the conversion takes extra processing time,
> and is unnecessary if you want UTF-8 from source text all the way
> through to the browser. That is, unnecessary work.
>
> All in all, for some of us entities are ok as a temporary solution as a
> backwards compatibility measure, but that pure UTF-8 will in many cases
> be the best, and should be the general goal.
>

How about we make it configurable with UTF-8 the default?

-- 
Dave Brondsema : dave@brondsema.net
http://www.brondsema.net : personal
http://www.splike.com : programming
http://csx.calvin.edu : student org

Mime
View raw message