commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@activemath.org>
Subject Re: [Digester] HTML entity decoding?
Date Wed, 22 Apr 2009 08:08:16 GMT

Le 22-avr.-09 à 06:06, Otis Gospodnetic a écrit :
> I'm no XML guru, so some of this stuff is fuzzy.  Please see my  
> comments/questions below.

I'm happy to help ;-)

> XML files I'm trying to parse do have "links" to DTDs in the  
> "header" (sometimes with a full http://... URL, and sometimes with  
> just a local file name), but there are no actual DTD files there.   
> Is the first step, then, making sure that the referenced DTD files  
> really exist at locations pointed to in the "header" of the XML?

The short answer is yes.
The long answer is yes except if you manage to configure xml catalogs  
(I think that, in the case of Xerces, something such as the  
XmlResolver is used) which associate "public-ids" to local files.  
That's best for performance but long to configure.

I suppose this going to be living in something that is not command- 
line so DTDs should be cached. At worst, make sure the property for  
such in the parser is st.

>> Here's a text pointing to such a DTD:
>> http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities
>
> So does this mean i would have to ensure that the DTD files contain  
> things like:
> <!ENTITY nbsp   "&#160;" >
> <!ENTITY iexcl  "&#161;" >
> <!ENTITY uuml   "&#252;" >
> ...
> and so on?
> And if my DTD had this, are you saying Digester would decode my:
> <name><![CDATA[Gr&uuml;ber]]></name>

no
but
<name>Gr&uuml;ber</name>
(the other form is exactly an escape which is equivalent to
   <name>Gr&amp;uuml;ber</name> not what you want!)

> to Grüber?  Or to &#252; ?

(both of the above are equivalent in XML compliant parsers. A method  
reading that XML would only receive Grüber.


> My end goal is to index this data with Lucene/Solr, so I need it to  
> be "Grüber" before I send it to Lucene/Solr.
> In other words, if I end up with &#252, this is still no good for  
> me, as I still wouldn't have Grüber.

You could also insert the DTDs inside the solr document.

>> Note that opening the file with a validating parser will certainly  
>> grumble about
>> all sorts of undeclared elements, this is ok, it does not prevent  
>> parsing but
>> is, indeed, a validation error.
>
> Uh, I'm lost here.  Which file are you referring to?  DTD or the XML  
> file?  Sounds like XML.  And why would I get complaints about  
> undeclared elements?

the DTD has the double function of declaring elements and attributes  
as well as entities.
DTD validation will fail if you have just defined entities in your DTD  
but not the relevant elements.
XML parsing will fail if you use entities that you have not defined.

>> However you get the entity-expansion.
>
> How?  If I make the XML parser validating?

if you use a conforming parsing.

> This is what I do to my Digester instance as soon as I create it:
>        dig.setValidating(false);

this is to prevent that validating failures (such as undeclared  
attributes or elements stop processing it is good.

>        dig.setEntityResolver(new NoOpEntityResolver());
> And that NoOpEntityResolver is my custom class that implements the  
> resolveEntity method:

I believe that is definitely the problem! ;-)
Please note that most DTD files that people refer to are easy to get  
publicly and are often bundled with software.

What kind of files are these that you are reading with Digester?
Do you have samples?
You seem to be lacking control of the DTDs in the same fancy way HTML  
files are done. I would consider NekoHtml tools then.

>> Note that using the first form, which contains an *escaped* entity,  
>> there's
>> nothing to do! You'd have to match them manually ("re-entrantly")  
>> into a parser
>> that parses entities properly.
>
> Uh, what does this mean? :)
> Are you saying "&uuml;" is the "escaped" form of the entity?  (what  
> would be the unescaped form of it?)

I was saying <![CDATA[Gr&uuml;ber]]> or Gr&amp;uuml;ber is the escaped  
form for which you can only fix by applying regexps (which might break  
other things).

> And what do you mean by there is nothing to do?  (I was hoping the  
> parser would do the work and convert "&uuml;" to "ü")
> I don't understand the last sentence.... so I'm not even sure how to  
> ask any questions about it.... but it sounds like you are saying  
> that some parsers may simply do what I need, just not Digester?  I'm  
> not sure what you mean by manual matching?

Digester is not a parser, it uses the JAXP-available parsers.
By default in JDK >= 1.5, this is a xerces copy (under com.sun  
packages).
If you have other parsers in the classpath these may be rather taken  
(something in META-INF can be used I think).

Xerces does a good job so it's definitely possible to work with it.  
E.g. DTD caching can be configured for it as well as catalogs.

Digester is there to make the interface between xml-parsing and java  
objects.
If you're just producing XML outside, there may be alternatives, indeed.

paul
Mime
View raw message