cocoon-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobia Conforto <>
Subject Parsing HTML entities
Date Fri, 31 Aug 2007 13:24:58 GMT

I have a data source from which I get SAX text nodes into my pipeline
that contain escaped HTML entities and <br> tags.  In Java syntax:

"Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer"

or, in XML syntax:

Lorem ipsum &amp;mdash; dolor sit amet. &lt;br&gt; Consectetuer

As you can see, the entities and <br> tags are escaped and part of the
text node.

I cannot change this data source component, therefore I need a
transformer to examine every text node in the stream, split it at the
fake "<br>" tags, substitute them with <xhtml:br/> elements, and
replace every escaped entity with the relevant Unicode character.

I tried doing it with the Parser transformer, but it's too slow.

I tried using the HTML transformer, but I couldn't get it to work.

My question is: what do you suggest I use on the Java side?

Is there anything like PHP's html_entity_decode() available somewhere
in a library that Cocoon is already using, that can parse and convert
HTML 4.0 entities with a single pass on the string?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message