commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Kitching <skitch...@apache.org>
Subject Re: Ignoring Specific Tags with Digester
Date Thu, 27 Jul 2006 21:26:18 GMT
On Thu, 2006-07-27 at 09:59 -0400, rjn wrote:
> Hi Everyone,
> 
> I'm trying to write a Syndication Feed parser using Digester, however
> I'm running into a stumbling block.  Many feeds have HTML in the
> entries such as <a>, <br>, etc.   Digester tries to parse these as XML
> tags, thus leading to blanks in the data I pull out.  I was wondering
> if there was way to set Digester to ignore specific tags (in this
> case, the HTML tags)?

No. Digester uses a standard xml parser to parse its input. That means
the input *must* be valid xml. If the input you have to handle isn't
valid xml, then you can't use an xml parser to parse it.

Perhaps you can use the NekoHTML parser to convert the input to valid
XML??
  http://java-source.net/open-source/html-parsers/nekohtml

Regards,

Simon


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Mime
View raw message