cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject Re: [RT] the quest for the perfect template language
Date Sat, 12 Apr 2003 16:54:29 GMT
on 4/10/03 8:41 PM Hunsberger, Peter wrote:

> Once more Cocoon hits the bleeding edge: lazy evaluation

We are getting used to it :-)

> Since lazy evaluation is almost as much of a research topic as it is
> anything else, this is as much a question as it is a proposal....  I think
> the issue can be restated as follows (RT mixed in with RT, sorry):
> Push vs. pull is the old space vs. time complexity trade off dressed up as
> XML parsing.  


> There is no single solution, only careful attention to design
> trade offs can find the answer for any given application.  However, in
> general, the emerging answer for XML parsers appears to be lazy evaluation:
> treat the tree as though it is fully parsed, but only do the work as needed.
> It's a combination of push and pull as demanded by the application.


>>This is where pull parsing would really rock, the problem is 
>>that such pull parsing is, in fact, a small xml database.
> Well with lazy evaluation you only index as you hit a node.  You only hit a
> node if someone gives you a reason to descend a particular branch.

you are kidding right? if you have a DOM, you are right, of course, but
if you have a Gb-long document, how do you know *where* the tree you
need to skip is going to end without parsing it?

Sure, the fact that you are not producing SAX events speeds you up, but
this is nothing compared to the speed I would gain if I pre-indexed the
stream and I knew where all the tokens were.

> Where
> this becomes a research topic is what do you do with poorly formed
> documents?  AFAIK lazy evaluation only works *easily* if you can assume that
> you can split up the input stream using basic XML semantics and not using
> detailed evaluation of each and every node...

I would be interested to know when this is possible.

>>And there might be a pretty big overhead in creating a small 
>>database (say, the equivalent of Xalan DTM or even that one) 
>>in order to facilitate indexing.
>>But maybe, that's exactly what Xalan does internally for the 
>>document() function, I really don't know.
> Well this brings up a whole 'nother issue: Xalan... (sigh)


>>Still, my point remains: the underlying amount of work the 
>>system has to do to come out with a simple variable using 
>>document() is incredible compared to the use of a simple 
>>method call of a taglib.
>>document() keeps on looking like a golden hammer antipattern to me.
> Well yes and no, if all you need is a "single" variable then don't use
> document.  However, how many times do you really need just a single
> variable?  (More on that in a moment.)
>>I think it would make perfect architectural sense as an 
>>interface to access a real xml database, but for accessing 
>>something like an xml-ized representation of session content, 
>>well, I'm not sure.
> Let me for a moment quote part of your "RE: [RT]  improving the session
> concept" response:
>>I found out that there are three different kinds of flows, not two as I
> previously assumed:
>> 1) fully stateless -> pure publishing
>> 2) fully stateful -> strict flows
>> 3) half/half -> everything else!!!
>>I'm pretty sure that 90% of web sites done in a modern technology belong
> to #3.
> Ok, here goes: session by it's very nature is persistent.  Thus, the index
> isn't useful for just a single reference to a single variable. In fact, I
> will argue (based on your above quote), this is true for 90% of the data
> that Cocoon touches:  session, flow and caching are all just different ways
> of dealing with the same problem; you need to be able to reference some
> amount of data across some period of time in such a way as to optimize the
> performance of your application.  Now one can argue that caching extends the
> problem from a single user to multiple users (as opposed to flow and
> session), but so what?  In many cases I want a piece of "my" data combined
> with the shared data, it's really all one big pot with differing rules on
> who can access what parts of it when.
> What does this mean?  Really, Cocoon is playing mostly where we've got some
> amount of data hanging around being persisted in various ways and accessed
> in various ways.  What would make more sense is in fact some kind of
> "transient" storage database that gives a common method for handling all of
> this data across the lifetimes required for all of the data.  What would
> really rock is that the parsers and serializers are wired directly into this
> database.  In other words, if something like Xalan builds us a nice DTM
> indexed data representation why do we throw it all away only to have it
> built again moments later?  We need a smart way of saying when to trash what
> across users, across requests to optimize processing. We also need
> normalized caching across all data in order to optimize memory usage. (Space
> and time optimization.)

yep, I see your point.

>>Still, even the jxpath has the pretty nasty problem of having 
>>to iterate over the whole stream of events to find out which 
>>one to substitute. Another performance problem, expecially 
>>for namespaced attributes which are very slow to process in 
>>SAX since they are not sent as events.
>>I really don't know, I think that, at this point, we need 
>>numbers to know what's really going on. numbers that compare 
>>an XSLT/document() approach against a jxpath approach.
>>Anybody wants to volunteer to benchmark this ;-)
> More fun to redesign the guts of Cocoon so that it has it's own
> semi-persistent XML document database and to build our own parsers and
> serializers on top of that...  ;-)

What we really need is an xml database with the ability to provide a
virtualized xml view to the cocoon object mode with an XQuery
implementation on top.

And this was my hope with xindice and slide as a reference
implementation of the JSR 170... oh well.

Peter, same for you: hold on your RT for a future where we have more
time to discuss new things. we have a release to do now.


View raw message