cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Boag/CAM/Lotus" <Scott_B...@lotus.com>
Subject RE: [Moving on] SAX vs. DOM part II
Date Tue, 25 Jan 2000 04:43:18 GMT

> The DOM is an interface defintion that does NOT require you to load the
> entire document into memory just as ODBC is an interface definition that
> does NOT require you to load the entire database into memory.

No, but the DOM *is* an object model, and the interface requires certain
features, such as the ability to get a child count, to be a full DOM.
Witness Xalan's DTM, which I would call a psuedo-DOM.  Because it is meant
to be used incrementally, you can't get a child count, which is a major
limitation to people wanting to do counter loops and the like.

> The DOM is an interface defintion that does NOT require you to load the
> entire document into memory just as ODBC is an interface definition that
> does NOT require you to load the entire database into memory.

Notice that the base ODBC definition does not allow you to get row count.

> Now, it may not be *easy* to create a DOM with such an implementation,
but
>that is precisely what we are bringing to the table.

I would be very interested in working with you guys on Xalan's Document
Table Model (DTM), or something similar to replace it.

> Our implementation
> enabled us to select records from a 10 gigabyte database, produce minimal
> DOM structure, and transform that DOM via an XSL engine in not much more
> time than it would take to retrieve results via a simple select
statement.
> Accordingly, I don't believe a DOM approach necessarily has significant
> impact on memory or speed considerations.

Returning node lists of a fixed structure is a different thing than
providing a full DOM tree that fullfills the standard interfaces.

> This might be easier with a SAX implementation in some cases, but other
> cases, as you have to mention here, SAX actually makes it more difficult
by
> introducing "internal buffers."

An internal tree pretty much always needs to be made for an XSLT processor.
The issue is, in the primary, performance-critical case, should the XSLT
engine be required to use generic DOM interfaces.  The answer is: this is
very problematic.

If you look at the design of Xalan, I think you'll see that we are pretty
much on the same page re the use of the DOM.  Xalan has always tried hard
to be a DOM-neutral as possible.  But, the fact remains, it is problematic.
It is one thing to work with a special known DOM implementation... quite
another to be able to work with any DOM.  And if you can't work with raw
DOM interfaces, you're not really working with a DOM... you're working with
a proprietary tree structure.

> I
> don't believe this statement is true with respect to databases and
> transactional processing. But again, the DOM could just as easily provide
> information incrementally as logic dictates.

I agree with you, and have designed the XLocator interfaces in Xalan for
exactly this purpose.  But Cocoon is a document processor, not a database
and transactional processor and report generator.

> For instance, if a
> header and footer requested the same data, the SAX model would require
two
> separate events, while the virtual DOM could provide the same node.

For the result tree??  I don't think this is true, as the DOM has no
reference model, since each node has to point back up to it's parent (OK,
you might be able to fake it with some really ugly tricks).

(I want to keep on responding to all your good points, but it's getting
late... unfortunately, this is one of my favorite subjects.)

> In fact, as a whole, I think it would be much better to take an additive
> approach. That is, maintain the current DOM interfaces and provide
> additional SAX capabilities.

The more I think about the issue in general, and your note, the more I
agree with this.  I like this approach far more than forcing one or the
other.  But it will complicate the architecture, and one should avoid
translation of one model to the other.  Xalan currently does both, for all
the reasons you name, so this is easy for me to say.  I still strongly
maintain that for high-performance servers, SAX pipes are the only way to
go.

-scott




                                                                                         
                          
                    John Milan                                                           
                          
                    <jmilan@DataCh        To:     cocoon-dev@xml.apache.org           
                             
                    annel.com>            cc:     "Clark C. Evans" <clark.evans@manhattanproject.com>,
Ted Leung    
                                          <twleung@sauria.com>, Scott Boag <Scott_Boag/CAM/Lotus@lotus.com>,
(bcc:  
                    01/24/00 02:09        Scott Boag/CAM/Lotus)                          
                          
                    PM                    Subject:     RE: [Moving on] SAX vs. DOM part II
                         
                    Please respond                                                       
                          
                    to cocoon-dev                                                        
                          
                                                                                         
                          
                                                                                         
                          






Hello all,

Sorry I'm so late to this thread; I don't usually check email over the
weekend.

I'd like to start out by forwarding the first part of a discussion Stefano
and I had. Stefano had
several good points to make, but I have not included them here because I
know the weight his
opinions carry. My position has changed somewhat, but I'd like to start
here
just to get peoples
input (ie. you're crazy :).

> -----Original Message-----
> From: Clark C. Evans [mailto:clark.evans@manhattanproject.com]
> Sent: Sunday, December 26, 1999 12:58 PM
> To: cocoon-dev@xml.apache.org
> Subject: Present: An internal processor architecture for Cocoon2 ?
>
>
> Stefano & Company,
>
> About 6-9 months ago, like many of you, I was
> waving the SAX banner and exclaiming that Cocoon
> should be build on top of SAX and not DOM.
>
> ...
>
> I've since changed my mind.  SAX is the opposite
> extreme, and has other problems associated with
> it namely:
>
>  (a) the programmer must explicitly manage
>      the state of their processor.
>
>  (b) the programmer must explicitly manage
>      the storage of intermediate items
>      needed by their processor.
>
> In DOM world, neither of these are problems;
> which is why Cocoon is great to program.
> Unfortunately, these two factors extract
> a price... memory and cpu usage.
>
> Instead of running from one extreme to another,
> the Cocoon group needs to devise its own interface.
> A ballance between the DOM and SAX extremes,
> one that can not only act "just like DOM", or
> "just like SAX", but can take on various
> shades of intermediate behavior.

I mostly agree with this sentiment up to this point. We believe this isn't
an either/or situation. However, I disagree with the sentiment that the DOM
represents an extreme.

In reading the Cocoon2 web page, I got the impression that you felt the DOM
dictates storage principles. A few quotes from this page:

1.  "This is mainly due to the fact that most (if not all!) DOM
implementations require the document to reside in memory."

2.  "even the most advanced and lightweight DOM implementation require at
least three to five times (and sometimes much more than this) more memory
than original document size."

I believe this is a major reason why you consider the DOM approach extreme.
But, nowhere in the DOM spec does it make any assertions on these issues.
As
a colleague (Tyson Chihaya) and I were discussing this, he came up with the
proper analogy:

The DOM is an interface defintion that does NOT require you to load the
entire document into memory just as ODBC is an interface definition that
does NOT require you to load the entire database into memory.

Now, it may not be *easy* to create a DOM with such an implementation, but
that is precisely what we are bringing to the table. Our implementation
enabled us to select records from a 10 gigabyte database, produce minimal
DOM structure, and transform that DOM via an XSL engine in not much more
time than it would take to retrieve results via a simple select statement.
Accordingly, I don't believe a DOM approach necessarily has significant
impact on memory or speed considerations.

In fact, I'd like to take the main points listed on the Cocoon2 page one by
one:

> incremental operation - the response is created during document
production.
> Client's perceived performance is dramatically improved since clients can

> start receiving data as soon as it is created, not after all processing
stages
> have been performed. In those cases where incremental operation is not
possible
> (for example, element sorting), internal buffers store the events until
the
> operation can be performed. However, even in these cases performance can
be
> increased with the use of tuned memory structures.

This might be easier with a SAX implementation in some cases, but other
cases, as you have to mention here, SAX actually makes it more difficult by
introducing "internal buffers." Just as the DOM doesn't dictate storage, it
also doesn't dictate a synchronous operation. It is very possible to do
incremental operation with a DOM interface. Then, you get the best of both
worlds, incremental operation on requests where it can be done, or the
sorting problem. In fact, I believe sorting will be the least of the
problems as you start tackling more 'real world' issues that might require
coordinating among serveral services.

> lowered memory consumption - since most of the server processing required
> in Cocoon is incremental, an incremental model allows XML production
events
> to be transformed directly into output events and character written on
> streams, thus avoiding the need to store them in memory.

I think this assertion can only be made for simple, static XML documents. I
don't believe this statement is true with respect to databases and
transactional processing. But again, the DOM could just as easily provide
information incrementally as logic dictates.

> easier scalability - reduce memory needs allow more concurrent operation
> to be possible, thus allowing the publishing system to scale as the
> load increases.

Our virtual DOM does not hinder scalability. Far from it, as it actually
might help with data re-use as a page is being rendered. For instance, if a
header and footer requested the same data, the SAX model would require two
separate events, while the virtual DOM could provide the same node. Perhaps
the "internal buffers" proposed early could do the same things, but would
they also have the API accessability that the DOM provides?

> more optimizable code model - modern virtual machines are based on
> the idea of hot spots, code fragments that are used often and, if
> optimized, increase the process execution by far. This new event model
> allows easier detection of hot spots since it's a method driven
operation,
> rather than a memory driven one. Hot methods can be identified earlier
> and their optimization performed better.

This could very well be true with static documents, so I can't dispute
this.
However, with dynamic documents, the "hot spots" will most likely come
during the information aquisition phase (ie. issuing the SQL and retrieving
the data). As a result, the bottlenecks won't necessarily be the code
itself, but the surrounding infrastructure-- maybe even infrastruture that,
if it's identified as a bottleneck, more hardware could be thrown at, such
as a slow disk on the database server.

> reduced garbage collection - even the most advanced and lightweight
> DOM implementation require at least three to five times (and sometimes
> much more than this) more memory than original document size. This does
> not only reduce the scalability of the operation, but also impact
> overall performance by increasing the number of memory garbage that
> must be collected after the response in sent to the client. Even if
> modern virtual machines reduced the overhead of garbage collection,
> less garbage will always have performance and scalability impacts.

As I mentioned before, the DOM does not dictate memory consumption, and we
believe we have the proper answer for this.

Finally, the summary paragraph contain this little blurb:

"...even if this event based model impacts not only the general
architecture
of the publishing system but also its internal processing components such
as
XSLT processing and PDF formatting. These components will require
substantial work and maybe design reconsideration to be able to follow a
pure event-based model. The Cocoon Project will work closely with the other
component projects to be able to influence their operation in this
direction."

I absolutely believe in an iterative approach to building code-- that if
something better comes along it's ok to toss the old stuff. However, the
DOM
has ably demonstrated its efficacy for XSLT processing and PDF formatting.
It seems to me there is no inherent DOM problem at this level of
processing.
In fact, this is a bit like throwing out a proven technology for one that
*should* work, but can not demonstrate substantial benefit, nor can
guarantee as good an integration. Given that, asking people who have
already
written an effective solution to an industry standard to rewrite things for
you seems like a lot to ask.

In fact, as a whole, I think it would be much better to take an additive
approach. That is, maintain the current DOM interfaces and provide
additional SAX capabilities. If they prove to be great, and people buy into
it, then consider dropping the DOM. But as it stands, DOM has much more
industry support, and has many more people working on bettering it than SAX
does. In fact, if you've looked at the DOM level II recommendation, it has
some interesting things to address some of your concerns, such as iterators
and filters. I fully believe the level III rec. will have even more, such
as
incorporating graph models for better information abstraction as oppposed
to
the current parent/child hierarchy model. I would propose Apache get on the
DOM committee (like us) and make your views known and help influence the
evolution of the DOM, rather than throwing it out (isn't that more of an
open source approach?)


John Milan
Architect





Mime
View raw message