commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: [Digester] HTML entity decoding?
Date Wed, 22 Apr 2009 21:32:49 GMT

Hi,

Thanks Paul.  I'm getting closer, but still not there.  More inlined comments/questions.



----- Original Message ----
> From: Paul Libbrecht <paul@activemath.org>
> 
> Le 22-avr.-09 à 06:06, Otis Gospodnetic a écrit :
> > XML files I'm trying to parse do have "links" to DTDs in the "header" 
> (sometimes with a full http://... URL, and sometimes with just a local file 
> name), but there are no actual DTD files there.  Is the first step, then, making 
> sure that the referenced DTD files really exist at locations pointed to in the 
> "header" of the XML?
> 
> The short answer is yes.
> The long answer is yes except if you manage to configure xml catalogs (I think 
> that, in the case of Xerces, something such as the XmlResolver is used) which 
> associate "public-ids" to local files. That's best for performance but long to 
> configure.

OK, I get this, although I don't know yet how to tell Digester to do this.

> I suppose this going to be living in something that is not command-line so DTDs 
> should be cached. At worst, make sure the property for such in the parser is st.

Actually, I do run this XML parsing tool from the command-line.

> >> Here's a text pointing to such a DTD:
> >> 
> http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities
> > 
> > So does this mean i would have to ensure that the DTD files contain things 
> like:
> > 
> > 
> > <!ENTITY uuml   "ü" >
> > ...
> > and so on?
> > And if my DTD had this, are you saying Digester would decode my:
> >  <name><![CDATA[Gr&uuml;ber]]></name>
> 
> no
> but
> Grüber
> (the other form is exactly an escape which is equivalent to
>   Gr&uuml;ber not what you want!)
> 
> > to Grüber?  Or to ü ?
> 
> (both of the above are equivalent in XML compliant parsers. A method reading 
> that XML would only receive Grüber.

Hm, still not sure what would get me Grüber, what exactly I'd need to do to make Digester
or the underlying parser go from:
  <name><![CDATA[Gr&uuml;ber]]></name>
to
   Grüber

> > My end goal is to index this data with Lucene/Solr, so I need it to be 
> "Grüber" before I send it to Lucene/Solr.
> > In other words, if I end up with ü, this is still no good for me, as I 
> still wouldn't have Grüber.
> 
> You could also insert the DTDs inside the solr document.

Uh, this would be very very complicated.  I just need to parse out that Grüber and store/index
it as such.

> >> Note that opening the file with a validating parser will certainly grumble 
> about
> >> all sorts of undeclared elements, this is ok, it does not prevent parsing but
> >> is, indeed, a validation error.
> > 
> > Uh, I'm lost here.  Which file are you referring to?  DTD or the XML file?  
> Sounds like XML.  And why would I get complaints about undeclared elements?
> 
> the DTD has the double function of declaring elements and attributes as well as 
> entities.
> DTD validation will fail if you have just defined entities in your DTD but not 
> the relevant elements.
> XML parsing will fail if you use entities that you have not defined.

OK.

> >> However you get the entity-expansion.
> > 
> > How?  If I make the XML parser validating?
> 
> if you use a conforming parsing.

So, it sounds like the following may be the recipe:
- make sure the referenced DTD files really exist and that the parser can get to them
- make sure the DTD files include entities used in the XML document
- turn XML validation on (?)
- run XML parser

... and now, because this now has access to the DTD files and those DTD files declare the
entities, the XML parser turn, say, &uump; into ü.

Is this correct?

> > This is what I do to my Digester instance as soon as I create it:
> >        dig.setValidating(false);
> 
> this is to prevent that validating failures (such as undeclared attributes or 
> elements stop processing it is good.

It is good to turn validation off?

> >        dig.setEntityResolver(new NoOpEntityResolver());
> > And that NoOpEntityResolver is my custom class that implements the 
> resolveEntity method:
> 
> I believe that is definitely the problem! ;-)

OK.  Perhaps the errors I was seeing were there because the DTD files were missing.

> Please note that most DTD files that people refer to are easy to get publicly 
> and are often bundled with software.

OK, so I really need to find those DTDs.

> What kind of files are these that you are reading with Digester?
> Do you have samples?


> You seem to be lacking control of the DTDs in the same fancy way HTML files are 
> done. I would consider NekoHtml tools then.

Yes, I don't have DTDs, but who knows, I may be able to find them.  Why do you suggest NekoHtml?

> >> Note that using the first form, which contains an *escaped* entity, there's
> >> nothing to do! You'd have to match them manually ("re-entrantly") into a 
> parser
> >> that parses entities properly.
> > 
> > Uh, what does this mean? :)
> > Are you saying "ü" is the "escaped" form of the entity?  (what would be 
> the unescaped form of it?)
> 
> I was saying <![CDATA[Gr&uuml;ber]]>  or Gr&uuml;ber is the escaped form
for 
> which you can only fix by applying regexps (which might break other things).

Hm, then, if I understand you correctly, you are saying I will *not* be able to get    Grüber
from <![CDATA[Gr&uuml;ber]]> because <![CDATA[Gr&uuml;ber]]> is the escaped
form (of    Grüber?) and the XML parser will not be able to "unescape" it to get me    Grüber
even if I follow the above steps?
And you are saying the only way for me to fix this is to manually replace &uump; with
ü ==> s/&uuml;/ü/ type of thing?
If that's what you are saying, them this is a completely manual process, nothing that XML
parser can help me with?  That doesn't sound right, so I'm probably misunderstanding you.

> > And what do you mean by there is nothing to do?  (I was hoping the parser 
> would do the work and convert "ü" to "ü")
> > I don't understand the last sentence.... so I'm not even sure how to ask any 
> questions about it.... but it sounds like you are saying that some parsers may 
> simply do what I need, just not Digester?  I'm not sure what you mean by manual 
> matching?
> 
> Digester is not a parser, it uses the JAXP-available parsers.
> By default in JDK >= 1.5, this is a xerces copy (under com.sun packages).
> If you have other parsers in the classpath these may be rather taken (something 
> in META-INF can be used I think).
> 
> Xerces does a good job so it's definitely possible to work with it. E.g. DTD 
> caching can be configured for it as well as catalogs.
> 
> Digester is there to make the interface between xml-parsing and java objects.
> If you're just producing XML outside, there may be alternatives, indeed.

Right.  I'm trying to stick with Digester because the XML I'm parsing would be a pain to parse
with straight Xerces.

Thanks,
Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Mime
View raw message