commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Kitching <>
Subject Re: [digester] noob failing to pull in raw XML
Date Mon, 13 Sep 2004 21:43:26 GMT
On Tue, 2004-09-14 at 05:43, Wade Chandler wrote:
> Peter Pimley wrote:
> > 
> > Hello everybody.
> > 
> > I'd like to use digester to parse an XML file.  What makes my situation 
> > unusual is that sometimes I want to be able to pull in raw XML withing 
> > trying to interpret it.  My documents are of the form:
> > 
> > <entries>
> >  <entry>(raw XML data)</entry>
> >  <entry>(even more data)</entry>
> >  etc....
> > 
> > 
> > All I know about the raw XML data is that it is gaurenteed -not- to 
> > contain an </entry> tag.  Other than that, your guess is as good as 
> > mine, as it comes from the users of my system.  It might not even be 
> > valid XML.  So, I just want to read it in as completely raw data up 
> > until the end tag.
> > 
> > My first attempt (I've never used digester before) was to add something 
> > like:
> > 
> > digester.addCallMethod ("entries/entry" "doStuff");
> > 
> > ... but this didn't work.  Typically, the raw XML starts with some start 
> > tag of its own, so the <entry> tag has an empty body.  The String passed 
> > into "doStuff" has zero length.
> > 
> > Is there a way to tell digester to ignore all XML tags from a certain 
> > node downwards?
> > 
> > Thanks in advance,
> > Peter Pimley, Semantico
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> > 
> > 
> > 
> Use a CDATA section in your XML.

Yes, if the content of the entry tag may not be valid XML, then a CDATA
section is your only option. Digester is simply a handler for SAX events
generated by an XML parser; the parser won't process the input if it
isn't valid XML and there is nothing that can be done in Digester about

And there is no xml parser I know of that can be told to "stop parsing
and just return raw text" when a certain tag is encountered. You might
possibly be able to do this by creating the parser yourself using an
InputSource you have created, then creating a custom Digester rule which
reads direct from that input source when fired, until the </entry>
string is found. The effect would be that the parser would essentially
never see the content of the <entry> tag. Tricky, though, particularly
if the parser uses "read-ahead" caching or similar.

Using CDATA tags is much cleaner if you can arrange for the input to
have them; content is just "text" contained in the enclosing element, so
can be handled using the usual digester rules.

If the input *is* valid XML then you may wish to look at the
NodeCreateRule digester rule class. It builds a DOM document fragment
from the input. This may or may not be what you want. There isn't a
factory method on the Digester class for NodeCreateRule because it's not
a commonly-used rule; you need to create it directly:

  NodeCreateRule ncr = new NodeCreateRule();
  digester.addRule("entries/entry", ncr);

  // and tell some parent object about the DOM object created 
  // from the entry node
  digester.addSetNext("entries/entry", "addDOMTree");



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message