abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Diephouse <dan.diepho...@mulesource.com>
Subject Re: Understanding Incremental Parsing [was Re: failing parser test]
Date Tue, 09 Oct 2007 21:38:38 GMT
James M Snell wrote:
> The incremental parser model assures that only the objects we actually
> need will be loaded into memory.  A better way to put it would be
> parse-on-demand.  Think of it as a hybrid between the SAX and DOM
> approaches.  The main advantage of this approach is that is uses
> significantly less memory than DOM.  
For times when you're reading only the first part of the document I can 
see when this would result in less memory and quicker access times. But 
for someone who needs to access most of the document - i.e. scan through 
the entries in the feed - the whole document will still need to be 
scanned/parsed, so that shouldn't result in any difference in 
memory/time over the normal DOM approach. That is, still an 
OMElementImpl will be created at some point each and every element. And 
each OMElement will stay have attributes, child elements, etc associated 
with it.

For instance - 
http://www.ibm.com/developerworks/webservices/library/ws-java2/. I think 
the Axiom numbers have probably improved to more JDOM/DOM4j levels since 
then, but still it shows that given equivalent documents which are 
eventually read/loaded into memory, it will have the same order of 
magnitude memory characteristics as anything else out there.

Or am I missing something here? Abdera doesn't just skip over elements 
which aren't accessed sequentially does it? Or are you saying that the 
benefit is just when you don't need to access the whole document? i.e. 
just read the feed metadata and not the entries?
> Another advantage is that is means
> we can introduce filters into the parsing process so that unwanted
> elements are ignored completely (that's the ParseFilter stuff you see in
> the core).  To illustrate the difference, a while back we used ROME
> (which uses JDOM) to parse Tim Bray's Atom feed and output just titles
> and links to System.out.  We used Abdera with a parse filter to do the
> exact same test.  The JDOM approach used over 6MB of memory; the Abdera
> approach used right around ~700 kb of memory.  The Abdera approach was
> significantly faster as well.
Were you skipping all the elements except for the titles? If so, a more 
fair comparison would've implemented a stax/sax filter for JDOM as well. 
Also, not sure what parser you used for JDOM, but Woodstox is 1.5-10x 
faster than the standard SAX parsers IIRC so that may have been a factor.

- Dan

Dan Diephouse
http://mulesource.com | http://netzooid.com/blog

View raw message