cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hunsberger, Peter" <Peter.Hunsber...@stjude.org>
Subject RE: [RT] the quest for the perfect template language
Date Mon, 14 Apr 2003 15:30:36 GMT
Stefano Mazzocchi <stefano@apache.org> wrote:

> 
> on 4/10/03 8:41 PM Hunsberger, Peter wrote:
> 
> > Once more Cocoon hits the bleeding edge: lazy evaluation
> 
> We are getting used to it :-)
> 
> > Since lazy evaluation is almost as much of a research topic 
> > as it is  anything else, this is as much a question as it is a 
> > proposal....  I  think the issue can be restated as follows (RT mixed in
with RT, 
> > sorry):
> > 
> > Push vs. pull is the old space vs. time complexity trade 
> > off dressed up as XML parsing.
> 
> True.
> 
> > There is no single solution, only careful attention to design trade 
> > offs can find the answer for any given application.  However, in 
> > general, the emerging answer for XML parsers appears to be lazy 
> > evaluation: treat the tree as though it is fully parsed, 
> > but only do the work as needed. It's a combination of push and pull as 
> > demanded by the application.
> 
> hmmmm
> 
> >>This is where pull parsing would really rock, the problem is
> >>that such pull parsing is, in fact, a small xml database.
> > 
> > Well with lazy evaluation you only index as you hit a node. 
> > You only  hit a node if someone gives you a reason to descend a
particular 
> > branch.
> 
> you are kidding right? if you have a DOM, you are right, of 
> course, but if you have a Gb-long document, how do you know 
> *where* the tree you need to skip is going to end without parsing it?
> 
> Sure, the fact that you are not producing SAX events speeds 
> you up, but this is nothing compared to the speed I would 
> gain if I pre-indexed the stream and I knew where all the tokens were.

Part of the issues depend on how you define "index". You may only know that
there is a start and end of some element congaing something you haven't yet
looked at in detail.  You may in fact know that there is a "token" at
location X but not what "element" it is (and probably not what attributes it
has); thus it's not a complete index until you actually have to evaluate
further. 

However, you're right; at some point poorly designed data models can force
consumption of the entire data stream.  The is true whether you're using SAX
or DOM: the only difference at that point is that with DOM it's an explicit
tree modeling the data and hierarchy and with SAX it's a call stack
containing the data within it's hierarchy.  One can't pretend that SAX
somehow always gets you around memory consumption problems. 

So, even with lazy evaluation you can code yourself into excess memory or
processor consumption.  Considering it's a generalized approach you really
can't expect anything else; the only way to do better is with explicit
domain knowledge.  Having said that, I should point out that part of the
research in lazy evaluation is to use schema models to gain as much "domain
knowledge" as possible: if you know X can't contain a "foo" there's no point
in looking inside it even when matching "*//foo"...
 
> > Where
> > this becomes a research topic is what do you do with poorly formed 
> > documents?  AFAIK lazy evaluation only works *easily* if you can 
> > assume that you can split up the input stream using basic XML 
> > semantics and not using detailed evaluation of each and 
> every node...
> 
> I would be interested to know when this is possible.

Apparently the XPP stuff is getting close, but I haven't looked in detail.

> >>And there might be a pretty big overhead in creating a small
> >>database (say, the equivalent of Xalan DTM or even that one) 
> >>in order to facilitate indexing.
> >>
> >>But maybe, that's exactly what Xalan does internally for the
> >>document() function, I really don't know.
> > 
> > Well this brings up a whole 'nother issue: Xalan... (sigh)
> > 
> 
> right
> 

<snip on rest of discussion since no issues seem to be raised/>

> >  
> > More fun to redesign the guts of Cocoon so that it has it's own 
> > semi-persistent XML document database and to build our own 
> parsers and 
> > serializers on top of that...  ;-)
> 
> What we really need is an xml database with the ability to 
> provide a virtualized xml view to the cocoon object mode with 
> an XQuery implementation on top.
> 
> And this was my hope with xindice and slide as a reference 
> implementation of the JSR 170... oh well.
> 
> Peter, same for you: hold on your RT for a future where we 
> have more time to discuss new things. we have a release to do now.
> 
Most definitely; we're about to hit our first alpha release tomorrow with
first beta in 30 day. There's no way I'll have any time to spend on this
until late summer at earliest.  In the mean time I do hope to get some more
exposure to parsers such as XPP...


Mime
View raw message