maven-doxia-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukas Theussl <>
Subject Re: entities: text or rawText?
Date Wed, 13 May 2009 11:44:59 GMT

For reference: the XhtmlBaseParser in Doxia 1.1.1 emits entities as text, except 
if they are not recognized (ie haven't been declared), then they are emitted as 
unknown events.


Vincent Siveton wrote:
> Hi Lukas,
> 2009/5/4 Lukas Theussl <>:
>> Vincent,
>> I'm trying to understand some of the issues we have with entities in the
>> XmlParser. Is there a special reason why entities are emitted as rawText and
>> not text?
> The text used by XhtmlBaseParser#handleEntity() could contain
> predefined entities [1] and numeric code entities (ie &AElig; will
> become &#198; by XmlPullParser)
> XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not.
> So using rawText() is to be sure to not escape text with entities.
>> I think they should be emitted as text:
>> First, custom entities can be used to simply define some replacement text
>> inside documents (eg <!ENTITY version "1.0">).
>> Second, the resulting events should be consumable by all sinks, not just
>> x(ht)ml based ones. Consider for instance the text "&amp;&AElig;" (where
>> AElig is defined as <!ENTITY AElig  "&#198;">). Currently it is emitted
>> the XhtmlBaseParser as one text event "&" and one rawText event "&#198;".
>> This means that eg the Latex Sink will produce wrong output (the AElig
>> should be converted to "\AE" in latex).
>> IMO the resolved entity should be emitted in a format-independent way, eg as
>> one (unicode?) character, just like &amp; is emitted as one character above.
>> The consuming sink then has to transform that into a format-specific
>> representation.
> It could be another implementation.
> XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text()
> Cheers,
> Vincent
> [1]

View raw message