Return-Path: Delivered-To: apmail-cocoon-users-archive@www.apache.org Received: (qmail 17895 invoked from network); 31 Aug 2007 15:58:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Aug 2007 15:58:58 -0000 Received: (qmail 82375 invoked by uid 500); 31 Aug 2007 15:58:47 -0000 Delivered-To: apmail-cocoon-users-archive@cocoon.apache.org Received: (qmail 82299 invoked by uid 500); 31 Aug 2007 15:58:47 -0000 Mailing-List: contact users-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: Reply-To: users@cocoon.apache.org List-Id: Delivered-To: mailing list users@cocoon.apache.org Received: (qmail 82288 invoked by uid 99); 31 Aug 2007 15:58:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Aug 2007 08:58:47 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ats37@hotmail.com designates 65.54.246.159 as permitted sender) Received: from [65.54.246.159] (HELO bay0-omc2-s23.bay0.hotmail.com) (65.54.246.159) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Aug 2007 15:59:43 +0000 Received: from BAY121-W6 ([207.46.10.41]) by bay0-omc2-s23.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668); Fri, 31 Aug 2007 08:58:19 -0700 Message-ID: X-Originating-IP: [193.108.78.10] From: Andrew Stevens To: Subject: RE: Parsing HTML entities Date: Fri, 31 Aug 2007 16:58:18 +0100 Importance: Normal In-Reply-To: References: <20070831132457.GC22743@localhost.localdomain> Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginalArrivalTime: 31 Aug 2007 15:58:19.0065 (UTC) FILETIME=[BFC48A90:01C7EBE7] X-Virus-Checked: Checked by ClamAV on apache.org Oh, for crying out loud. Even after switching to plain text Hotmail still= =20 strips out my included XML :-( Let's try again - replace the square brackets below with the appropriate=20 less-than and greater-than symbols. > From: joerg.heinicke@gmx.de > Date: Fri, 31 Aug 2007 14:06:59 +0000 > > Tobia Conforto < tobia.conforto < at> linux.it> writes: > >> I have a data source from which I get SAX text nodes into my pipeline >> that contain escaped HTML entities and=20 tags. In Java syntax: >> >> "Lorem ipsum =97 dolor sit amet. < br> Consectetuer" >> >> or, in XML syntax: >> >> Lorem ipsum — dolor sit amet.
Consectetuer >> >> As you can see, the entities and < br> tags are escaped and part of the >> text node. >> >> I cannot change this data source component, therefore I need a >> transformer to examine every text node in the stream, split it at the >> fake "< br>" tags, substitute them with < xhtml:br/> elements, and >> replace every escaped entity with the relevant Unicode character. > > That's one of the rare cases where I consider < xsl:text > disable-output-escaping=3D"yes"> a valid approach [1]. I don't know if th= ere is > something comparable directly on the Java side. Unless I'm mistaken, doing that on his example would result in an invalid document as there's no matching [/br] element...? It would be okay if it can be guaranteed that the included text is nice well-formed XHTML, but if it's plain old HTML then it sounds to me more like a job for the jtidy or neko-based HTML transformers. We have something similar in our application; I arrange the early part of t= he=20 pipeline so that the escaped HTML appears within a unique element e.g. [some_escaped_html]Lorem ipsum & lt;br& gt; dolor[/some_escaped_html] , pass it through the html transformer [map:transform type=3D"html"] [map:parameter name=3D"tags" value=3D"some_escaped_html"/] [/map:transform] and follow that by a small xsl transformation to strip out the some_escaped= _html elements (and the html & body elements that JTidy inserts) [xsl:template match=3D"vf_escaped_html"] [xsl:apply-templates select=3D"html/body/*"/] [/xsl:template] + the usual "passthrough" templates for all other nodes. Net result, the same SAX stream but with the HTML unescaped and cleaned up so it's well-formed again. Andrew. _________________________________________________________________ Get free emoticon packs and customisation from Windows Live.=20 http://www.pimpmylive.co.uk= --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org For additional commands, e-mail: users-help@cocoon.apache.org