cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Stevens <>
Subject RE: Parsing HTML entities
Date Fri, 31 Aug 2007 15:04:59 GMT

> From:
> Date: Fri, 31 Aug 2007 14:06:59 +0000
> Tobia Conforto> writes:
>> I have a data source from which I get SAX text nodes into my pipeline
>> that contain escaped HTML entities and 
 tags. In Java syntax:
>> "Lorem ipsum — dolor sit amet. 
>> or, in XML syntax:
>> Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer
>> As you can see, the entities and 
 tags are escaped and part of the
>> text node.
>> I cannot change this data source component, therefore I need a
>> transformer to examine every text node in the stream, split it at the
>> fake "
" tags, substitute them with  elements, and
>> replace every escaped entity with the relevant Unicode character.
> That's one of the rare cases where I consider  disable-output-escaping="yes"> a valid
approach [1]. I don't know if there is
> something comparable directly on the Java side.

Unless I'm mistaken, doing that on his example would result in an invalid
document as there's no matching  element...?  It would be okay if it
can be guaranteed that the included text is nice well-formed XHTML, but if
it's plain old HTML then it sounds to me more like a job for the jtidy or
neko-based HTML transformers.

We have something similar in our application; I arrange the early part of the 
pipeline so that the escaped HTML appears within a unique element e.g.

Lorem ipsum <br&ht; dolor

, pass it through the html transformer

and follow that by a small xsl transformation to strip out the some_escaped_html
elements (and the html & body elements that JTidy inserts)

+ the usual "passthrough" templates for all other nodes.

Net result, the same SAX stream but with the HTML unescaped and cleaned
up so it's well-formed again.


Get free emoticon packs and customisation from Windows Live.
To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message