commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@activemath.org>
Subject Re: [Digester] HTML entity decoding?
Date Wed, 15 Apr 2009 22:24:05 GMT
Hello Otis,

For the second form you'll need to hook a DTD to do so. A DTD  
declaration in your header pointing to a DTD which defines these  
entities I am no expert in Digester but I believe that it is the only  
way to do so. At least according to the XML specs.

Here's a text pointing to such a DTD:
   http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities

Note that opening the file with a validating parser will certainly  
grumble about all sorts of undeclared elements, this is ok, it does  
not prevent parsing but is, indeed, a validation error.
However you get the entity-expansion.

Note that using the first form, which contains an *escaped* entity,  
there's nothing to do! You'd have to match them manually ("re- 
entrantly") into a parser that parses entities properly.

paul

PS: I would feel lucky not to have been blown away the XML parsing in  
the second case as a normal XML parser does: missing entity  
declaration means unparseable XML while missing element declaration  
means much less a dangerous thing.

Le 16-avr.-09 à 00:06, Otis Gospodnetic a écrit :

>
> Hello,
>
> I'm using Digester 2.0 and trying to process XML that
> may include HTML entities and trying to get Digester to decode them
> when parsing.
>
> For example, my XML contains:
>  <name><![CDATA[Gr&uuml;ber]]></name>
>
> Currently, Digester is parses this as:  Gr&uuml;ber
>
> But what I am really after is "Grüber", so I am looking for a way to  
> get this &uuml; entity decoded by Digester.
> How do I tell Digester to decode HTML entities?
>
> Also, if I don't use CDATA, like this:
>  <name>Gr&uuml;ber</name>
>
> Digester gives me: Grber


Mime
View raw message