commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrian Sutton" <adrian.sut...@ephox.com>
Subject RE: [digester] reading embedded HTML (or other mixed text)
Date Sun, 23 May 2004 22:44:00 GMT
Sounds like you may want to run the HTML section through JTidy
(http://jtidy.sourceforge.net) to convert it to XHTML first.  Then
Digester should be able to at least parse it.

Regards,

Adrian Sutton. 

-----Original Message-----
From: Simon Kitching [mailto:simon@ecnetwork.co.nz] 
Sent: Monday, 24 May 2004 8:39 AM
To: Jakarta Commons Users List
Subject: Re: [digester] reading embedded HTML (or other mixed text)

On Fri, 2004-05-21 at 12:34, Bill Keese wrote:
> Is there any way to tell digester to read in the entire content of an
> element (including text and sub-elements) as a single String? For
> example, if I persist e-mail to XML, I'd like to use digester to read
> the e-mail address list, etc., but the HTML content of the mail should
> be read verbatim.
> 

Hi Bill,

HTML is not valid XML. Digester uses a standard XML parser to parse the
input, so it is not possible to process an input document which is not
valid XML.

As Jose has said in a separate reply, you could wrap your HTML in CDATA
tags in the input document. The xml parser will then see the contents of
that cdata section as just a text string - and so will Digester.

Alternatively, you could use XHTML, which most browsers support. In this
case, you could then use NodeCreateRule.

Regards,

Simon 


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Mime
View raw message