commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jared Graber" <jgra...@zedak.com>
Subject RE: Ignoring Specific Tags with Digester
Date Thu, 27 Jul 2006 22:00:50 GMT
AFAIK,

Digester seems to be doing exactly what it is supposed to do.  It is
treating XHTML tags just like XML tags (which they are).  If you want to
keep the XHTML information I figure you have a few options:

1.  Write a custom parser (seems like overkill) 
2.  XML-escape the XHTML tags prior to feeding the stream into Digester and
deal w/ them later if necessary.  
3.  Use a more complex data structures so that your objects can represent
paragraph breaks, anchors, etc.  

Those are just a few options that come to mind.

If using these tags causes the feed to not be valid XML, then option 2 is
probably your best bet.

If I'm totally off base, let me know.

-Jared

-----Original Message-----
From: Paul J DeCoursey [mailto:paul@decoursey.net] 
Sent: Thursday, July 27, 2006 5:31 PM
To: Jakarta Commons Users List
Subject: Re: Ignoring Specific Tags with Digester

Simon Kitching wrote:
> On Thu, 2006-07-27 at 09:59 -0400, rjn wrote:
>   
>> Hi Everyone,
>>
>> I'm trying to write a Syndication Feed parser using Digester, however
>> I'm running into a stumbling block.  Many feeds have HTML in the
>> entries such as <a>, <br>, etc.   Digester tries to parse these as XML
>> tags, thus leading to blanks in the data I pull out.  I was wondering
>> if there was way to set Digester to ignore specific tags (in this
>> case, the HTML tags)?
>>     
>
> No. Digester uses a standard xml parser to parse its input. That means
> the input *must* be valid xml. If the input you have to handle isn't
> valid xml, then you can't use an xml parser to parse it.
>
> Perhaps you can use the NekoHTML parser to convert the input to valid
> XML??
>   http://java-source.net/open-source/html-parsers/nekohtml
>
> Regards,
>
> Simon
>
>   
I don't think that was the question. I'm guessing the xml is valid, it's 
just not dealing with the xhtml part of it correctly. I'm not too 
familiar with Digester to know the solution however.

pd





---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Mime
View raw message