abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James M Snell <jasn...@gmail.com>
Subject Re: Understanding Incremental Parsing [was Re: failing parser test]
Date Tue, 09 Oct 2007 21:58:38 GMT

Dan Diephouse wrote:
> James M Snell wrote:
>> The incremental parser model assures that only the objects we actually
>> need will be loaded into memory.  A better way to put it would be
>> parse-on-demand.  Think of it as a hybrid between the SAX and DOM
>> approaches.  The main advantage of this approach is that is uses
>> significantly less memory than DOM.  
> For times when you're reading only the first part of the document I can
> see when this would result in less memory and quicker access times. But
> for someone who needs to access most of the document - i.e. scan through
> the entries in the feed - the whole document will still need to be
> scanned/parsed, so that shouldn't result in any difference in
> memory/time over the normal DOM approach. That is, still an
> OMElementImpl will be created at some point each and every element. And
> each OMElement will stay have attributes, child elements, etc associated
> with it.
> For instance -
> http://www.ibm.com/developerworks/webservices/library/ws-java2/. I think
> the Axiom numbers have probably improved to more JDOM/DOM4j levels since
> then, but still it shows that given equivalent documents which are
> eventually read/loaded into memory, it will have the same order of
> magnitude memory characteristics as anything else out there.

True, but even when fully parsing a document, because of the way Axiom
is implemented, we still realize a significant memory and speed
improvement when working with the full document.  I'd encourage you to
run some of the numbers yourself.

> Or am I missing something here? Abdera doesn't just skip over elements
> which aren't accessed sequentially does it? Or are you saying that the
> benefit is just when you don't need to access the whole document? i.e.
> just read the feed metadata and not the entries?

Abdera only consumes the stream when it's absolutely necessary to do so.
 Elements are not skipped over unless there is a ParseFilter in place
telling it to do so.

If I have a Feed with 100 entries, and all I do is feed.getTitle(), the
100 entries will never be parsed.  Because Atom requires that the
entries come after the rest of the feed level elements, I can read all
of the feed metadata without ever having to parse the individual elements.

When I call feed.getEntries(), Abdera returns a special List
implementation that uses an internal iterator.  That iterator will
incrementally parse the stream, so if I do for (Entry entry :
feed.getEntries()), each loop will incrementally parse the stream;
however, if I do for (int n = 0; n < feed.getEntries().size(); n++), the
call to size() will result in the entire stream being consumed in order
to respond with the correct number of entries.

>> Another advantage is that is means
>> we can introduce filters into the parsing process so that unwanted
>> elements are ignored completely (that's the ParseFilter stuff you see in
>> the core).  To illustrate the difference, a while back we used ROME
>> (which uses JDOM) to parse Tim Bray's Atom feed and output just titles
>> and links to System.out.  We used Abdera with a parse filter to do the
>> exact same test.  The JDOM approach used over 6MB of memory; the Abdera
>> approach used right around ~700 kb of memory.  The Abdera approach was
>> significantly faster as well.
> Were you skipping all the elements except for the titles? If so, a more
> fair comparison would've implemented a stax/sax filter for JDOM as well.
> Also, not sure what parser you used for JDOM, but Woodstox is 1.5-10x
> faster than the standard SAX parsers IIRC so that may have been a factor.

The test was based on the interfaces that ROME exposed at the time.
>>From what I recall, there was not a way for us to plug in any kind of
parse filter.  We could have just missed it, however.

- James

> - Dan

View raw message