commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: [Digester] HTML entity decoding?
Date Wed, 22 Apr 2009 04:06:45 GMT

Hi Paul,

I'm no XML guru, so some of this stuff is fuzzy.  Please see my comments/questions below.



----- Original Message ----
> From: Paul Libbrecht <paul@activemath.org>
> To: Commons Users List <user@commons.apache.org>
> Sent: Wednesday, April 15, 2009 6:24:05 PM
> Subject: Re: [Digester] HTML entity decoding?
> 
> Hello Otis,
> 
> For the second form you'll need to hook a DTD to do so. A DTD declaration in 
> your header pointing to a DTD which defines these entities I am no expert in 
> Digester but I believe that it is the only way to do so. At least according to 
> the XML specs.

XML files I'm trying to parse do have "links" to DTDs in the "header" (sometimes with a full
http://... URL, and sometimes with just a local file name), but there are no actual DTD files
there.  Is the first step, then, making sure that the referenced DTD files really exist at
locations pointed to in the "header" of the XML?

> Here's a text pointing to such a DTD:
>   
> http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities

So does this mean i would have to ensure that the DTD files contain things like:

<!ENTITY nbsp   "&#160;" >
<!ENTITY iexcl  "&#161;" >
<!ENTITY uuml   "&#252;" >
...
and so on?
And if my DTD had this, are you saying Digester would decode my:
<name><![CDATA[Gr&uuml;ber]]></name>

to Grüber?  Or to &#252; ?

My end goal is to index this data with Lucene/Solr, so I need it to be "Grüber" before I
send it to Lucene/Solr.  In other words, if I end up with &#252, this is still no good
for me, as I still wouldn't have Grüber.

> Note that opening the file with a validating parser will certainly grumble about 
> all sorts of undeclared elements, this is ok, it does not prevent parsing but 
> is, indeed, a validation error.

Uh, I'm lost here.  Which file are you referring to?  DTD or the XML file?  Sounds like XML.
 And why would I get complaints about undeclared elements?

> However you get the entity-expansion.

How?  If I make the XML parser validating?  This is what I do to my Digester instance as soon
as I create it:
        dig.setValidating(false);
        dig.setEntityResolver(new NoOpEntityResolver());

And that NoOpEntityResolver is my custom class that implements the resolveEntity method:

public class NoOpEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicId, String systemId) {
    // this method just
        if (systemId.equals("file:///tmp/dtd/foo-1.2.dtd")
                || systemId.equals("http://example.com/dtd/foo-1.2.dtd") {
            BlankReader reader = new BlankReader();
            // return a special input source                                             
                                                     
            return new InputSource(reader);
        } else {
            // use the default behaviour                                                 
                                                     
            return null;
        }
    }
    class BlankReader extends Reader {
        @Override
        public void close() throws IOException {}
        @Override
        public int read(char[] arg0, int arg1, int arg2) throws IOException {
            return -1;
        }
    }

Could this be a problem?  I had to add this class to stop Digester from breaking, if I recall
correctly, because those .dtd files don't actually exist.

> Note that using the first form, which contains an *escaped* entity, there's 
> nothing to do! You'd have to match them manually ("re-entrantly") into a parser 
> that parses entities properly.

Uh, what does this mean? :)
Are you saying "&uuml;" is the "escaped" form of the entity?  (what would be the unescaped
form of it?)
And what do you mean by there is nothing to do?  (I was hoping the parser would do the work
and convert "&uuml;" to "ü")
I don't understand the last sentence.... so I'm not even sure how to ask any questions about
it.... but it sounds like you are saying that some parsers may simply do what I need, just
not Digester?  I'm not sure what you mean by manual matching?

Any further help would be greatly appreciated.

Thanks,
Otis

> paul
> 
> PS: I would feel lucky not to have been blown away the XML parsing in the second 
> case as a normal XML parser does: missing entity declaration means unparseable 
> XML while missing element declaration means much less a dangerous thing.
> 
> Le 16-avr.-09 à 00:06, Otis Gospodnetic a écrit :
> 
> > 
> > Hello,
> > 
> > I'm using Digester 2.0 and trying to process XML that
> > may include HTML entities and trying to get Digester to decode them
> > when parsing.
> > 
> > For example, my XML contains:
> >  
> > 
> > Currently, Digester is parses this as:  Grüber
> > 
> > But what I am really after is "Grüber", so I am looking for a way to get this 
> ü entity decoded by Digester.
> > How do I tell Digester to decode HTML entities?
> > 
> > Also, if I don't use CDATA, like this:
> >  Grüber
> > 
> > Digester gives me: Grber


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Mime
View raw message