cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobia Conforto <>
Subject Re: Parsing HTML entities
Date Mon, 17 Sep 2007 11:33:52 GMT
Andrew Stevens wrote:
> Tobia Conforto writes:
> > I cannot change this data source component, therefore I need a
> > transformer to examine every text node in the stream, split it at the
> > fake "<br>" tags, substitute them with <xhtml:br/> elements, and
> > replace every escaped HTML entity with the relevant Unicode character.
> We have something similar in our application; I arrange the early part
> of the pipeline so that the escaped HTML appears within a unique
> element e.g.
>   <some_escaped_html>Lorem ipsum &lt;br&gt; dolor</some_escaped_html>
> pass it through the html transformer
>   <map:transform type="html">
>     <map:parameter name="tags" value="some_escaped_html"/>
>   </map:transform>
> and follow that by a small xsl transformation to strip out the
> some_escaped_html elements and the html & body elements that JTidy
> inserts.
> Net result, the same SAX stream but with the HTML unescaped and
> cleaned up so it's well-formed again.

Thank you.
After extensive testing, turns out this is the best method.

It works for any kind of malformed HTML and is efficient enough,
provided I put <some_escaped_html> tags only where they are needed.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message