cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Milan <jmi...@DataChannel.com>
Subject RE: [Moving on] SAX vs. DOM part II
Date Mon, 24 Jan 2000 19:09:57 GMT


Hello all,

Sorry I'm so late to this thread; I don't usually check email over the
weekend.

I'd like to start out by forwarding the first part of a discussion Stefano
and I had. Stefano had
several good points to make, but I have not included them here because I
know the weight his
opinions carry. My position has changed somewhat, but I'd like to start here
just to get peoples
input (ie. you're crazy :).

> -----Original Message-----
> From: Clark C. Evans [mailto:clark.evans@manhattanproject.com]
> Sent: Sunday, December 26, 1999 12:58 PM
> To: cocoon-dev@xml.apache.org
> Subject: Present: An internal processor architecture for Cocoon2 ?
>
>
> Stefano & Company,
> 
> About 6-9 months ago, like many of you, I was 
> waving the SAX banner and exclaiming that Cocoon 
> should be build on top of SAX and not DOM.
>
> ...
>
> I've since changed my mind.  SAX is the opposite
> extreme, and has other problems associated with
> it namely:
>
>  (a) the programmer must explicitly manage 
>      the state of their processor.
>
>  (b) the programmer must explicitly manage
>      the storage of intermediate items
>      needed by their processor.
>
> In DOM world, neither of these are problems;
> which is why Cocoon is great to program.  
> Unfortunately, these two factors extract 
> a price... memory and cpu usage.
> 
> Instead of running from one extreme to another,
> the Cocoon group needs to devise its own interface.
> A ballance between the DOM and SAX extremes,
> one that can not only act "just like DOM", or
> "just like SAX", but can take on various
> shades of intermediate behavior.

I mostly agree with this sentiment up to this point. We believe this isn't
an either/or situation. However, I disagree with the sentiment that the DOM
represents an extreme.

In reading the Cocoon2 web page, I got the impression that you felt the DOM
dictates storage principles. A few quotes from this page:

1.  "This is mainly due to the fact that most (if not all!) DOM
implementations require the document to reside in memory."

2.  "even the most advanced and lightweight DOM implementation require at
least three to five times (and sometimes much more than this) more memory
than original document size."

I believe this is a major reason why you consider the DOM approach extreme.
But, nowhere in the DOM spec does it make any assertions on these issues. As
a colleague (Tyson Chihaya) and I were discussing this, he came up with the
proper analogy:

The DOM is an interface defintion that does NOT require you to load the
entire document into memory just as ODBC is an interface definition that
does NOT require you to load the entire database into memory.

Now, it may not be *easy* to create a DOM with such an implementation, but
that is precisely what we are bringing to the table. Our implementation
enabled us to select records from a 10 gigabyte database, produce minimal
DOM structure, and transform that DOM via an XSL engine in not much more
time than it would take to retrieve results via a simple select statement.
Accordingly, I don't believe a DOM approach necessarily has significant
impact on memory or speed considerations.

In fact, I'd like to take the main points listed on the Cocoon2 page one by
one:

> incremental operation - the response is created during document
production.
> Client's perceived performance is dramatically improved since clients can 
> start receiving data as soon as it is created, not after all processing
stages
> have been performed. In those cases where incremental operation is not
possible
> (for example, element sorting), internal buffers store the events until
the 
> operation can be performed. However, even in these cases performance can
be
> increased with the use of tuned memory structures.

This might be easier with a SAX implementation in some cases, but other
cases, as you have to mention here, SAX actually makes it more difficult by
introducing "internal buffers." Just as the DOM doesn't dictate storage, it
also doesn't dictate a synchronous operation. It is very possible to do
incremental operation with a DOM interface. Then, you get the best of both
worlds, incremental operation on requests where it can be done, or the
sorting problem. In fact, I believe sorting will be the least of the
problems as you start tackling more 'real world' issues that might require
coordinating among serveral services.

> lowered memory consumption - since most of the server processing required
> in Cocoon is incremental, an incremental model allows XML production
events
> to be transformed directly into output events and character written on
> streams, thus avoiding the need to store them in memory. 

I think this assertion can only be made for simple, static XML documents. I
don't believe this statement is true with respect to databases and
transactional processing. But again, the DOM could just as easily provide
information incrementally as logic dictates.

> easier scalability - reduce memory needs allow more concurrent operation
> to be possible, thus allowing the publishing system to scale as the
> load increases. 

Our virtual DOM does not hinder scalability. Far from it, as it actually
might help with data re-use as a page is being rendered. For instance, if a
header and footer requested the same data, the SAX model would require two
separate events, while the virtual DOM could provide the same node. Perhaps
the "internal buffers" proposed early could do the same things, but would
they also have the API accessability that the DOM provides?

> more optimizable code model - modern virtual machines are based on
> the idea of hot spots, code fragments that are used often and, if
> optimized, increase the process execution by far. This new event model
> allows easier detection of hot spots since it's a method driven operation,
> rather than a memory driven one. Hot methods can be identified earlier
> and their optimization performed better. 

This could very well be true with static documents, so I can't dispute this.
However, with dynamic documents, the "hot spots" will most likely come
during the information aquisition phase (ie. issuing the SQL and retrieving
the data). As a result, the bottlenecks won't necessarily be the code
itself, but the surrounding infrastructure-- maybe even infrastruture that,
if it's identified as a bottleneck, more hardware could be thrown at, such
as a slow disk on the database server.

> reduced garbage collection - even the most advanced and lightweight
> DOM implementation require at least three to five times (and sometimes
> much more than this) more memory than original document size. This does
> not only reduce the scalability of the operation, but also impact
> overall performance by increasing the number of memory garbage that
> must be collected after the response in sent to the client. Even if
> modern virtual machines reduced the overhead of garbage collection,
> less garbage will always have performance and scalability impacts. 

As I mentioned before, the DOM does not dictate memory consumption, and we
believe we have the proper answer for this.

Finally, the summary paragraph contain this little blurb:

"...even if this event based model impacts not only the general architecture
of the publishing system but also its internal processing components such as
XSLT processing and PDF formatting. These components will require
substantial work and maybe design reconsideration to be able to follow a
pure event-based model. The Cocoon Project will work closely with the other
component projects to be able to influence their operation in this
direction."

I absolutely believe in an iterative approach to building code-- that if
something better comes along it's ok to toss the old stuff. However, the DOM
has ably demonstrated its efficacy for XSLT processing and PDF formatting.
It seems to me there is no inherent DOM problem at this level of processing.
In fact, this is a bit like throwing out a proven technology for one that
*should* work, but can not demonstrate substantial benefit, nor can
guarantee as good an integration. Given that, asking people who have already
written an effective solution to an industry standard to rewrite things for
you seems like a lot to ask.

In fact, as a whole, I think it would be much better to take an additive
approach. That is, maintain the current DOM interfaces and provide
additional SAX capabilities. If they prove to be great, and people buy into
it, then consider dropping the DOM. But as it stands, DOM has much more
industry support, and has many more people working on bettering it than SAX
does. In fact, if you've looked at the DOM level II recommendation, it has
some interesting things to address some of your concerns, such as iterators
and filters. I fully believe the level III rec. will have even more, such as
incorporating graph models for better information abstraction as oppposed to
the current parent/child hierarchy model. I would propose Apache get on the
DOM committee (like us) and make your views known and help influence the
evolution of the DOM, rather than throwing it out (isn't that more of an
open source approach?)


John Milan
Architect

Mime
View raw message