cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Russell <>
Subject Re: External Entities & bundles
Date Thu, 18 May 2000 11:39:59 GMT
On Thu, May 18, 2000 at 06:19:33AM -0500, Mike Engelhart wrote:
> on 5/18/00 3:13 AM, Paul Russell at wrote:
> No, I haven't gone that far yet.  Is that what you're doing? It seems like a
> lot of extra overhead for something that should be working to begin with.
> I mean, parsing every single word on a page to look for ampersands seems
> like a lot of extra work for a busy site??

What I've been doing is at the other end (the output side) and
is basically doing this backwards; taking unicode characters and
escaping them into entities. In my case it isn't much of an over-
head as we already have to think down to the character level.
(and I ignore anything with a code of <128, because they don't
need escaping anyway). If you were doing it the other way
around, again it's not too bad, because you can do an indexOf("&")
(which is fast) on the incoming resources to find the entities.

The thing is that this *shouldn't* be 'working' as you put it.
If you put something into a resource bundle, it is text, *not*
XML. The fact that you happen to bring it into an XML event
stream using XSP is academic.

All XML event streams in java are unicode (as they should be),
this means that as soon as the XML file is parsed into a SAX
stream, the entities are replaced by their unicode character
alter-egos. This means that when you bring the ResourceBundle
in (which is already encoded in some way - *not* using character
entities), you are bring in the '&','o','u','m','l' and ';'
unicode characters, *not* an entity itself. Therefore, when
the XML stream is serialized to HTML, it becomes "&amp;ouml;".

The only way 'around' this (although it is the correct
behaviour) is to *somehow* get the character you're after
into unicode. Two ways of doing that - firstly by putting
the character into whatever encoding the ResouceBundle is
using, or secondly by adding another layer of encoding (such
as using HTML character entities) over the top and then using
a decoding algorithm on the incoming resources.

Paul Russell                               <>
Technical Director,         
Luminas Ltd.

View raw message