cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hunsberger, Peter" <Peter.Hunsber...@stjude.org>
Subject RE: [RT] the quest for the perfect template language
Date Thu, 10 Apr 2003 18:41:15 GMT
Stefano Mazzocchi <stefano@apache.org> wrote:

> on 4/9/03 10:40 PM Hunsberger, Peter wrote:
> 
> > Stefano Mazzocchi <stefano@apache.org> asked:
> > 
> > 
> >>So, this list seems full of XSLT lovers, then get your brain cells
> >>working: how do we sort the performance issues of document()?
> >>
> > I realize this isn't what you're asking for, but the 
> following came up 
> > on xml-dev the other day:
> > 
> http://www-106.ibm.com/developerworks/xml/library/x-injava/index.html
> > 
> > It's a comparison of the performance of various parsers.  
> > Interestingly enough (when considered with some of the other 
> > discussion in this thread) a pull model parser (XPP) comes 
> out on top 
> > most of the time.
> 
> damn, you spoiled my future RT about pulling vs. pushing pipelines :-)

Maybe it's still needed....  I've got a RT fermenting on this whole issue,
see below:

> 
> > Isn't the document issue really attacked by treating it 
> > exactly as any other
> > *internal* Cocoon URI reference (via the URI resolver hook)? 
> 
> The problem is that you are pulling data from a stream that 
> gets pushed to you.
>
> This is the same impedence mismatch of JSP/velocity as 
> generators where a parser needs to be placed in between and 
> performance is degraded compared to a native-sax 
> push-oriented generation stage which is directly connected to 
> the pipe.

Once more Cocoon hits the bleeding edge: lazy evaluation

Since lazy evaluation is almost as much of a research topic as it is
anything else, this is as much a question as it is a proposal....  I think
the issue can be restated as follows (RT mixed in with RT, sorry):

Push vs. pull is the old space vs. time complexity trade off dressed up as
XML parsing.  There is no single solution, only careful attention to design
trade offs can find the answer for any given application.  However, in
general, the emerging answer for XML parsers appears to be lazy evaluation:
treat the tree as though it is fully parsed, but only do the work as needed.
It's a combination of push and pull as demanded by the application.
 
> If you do something like
> 
>   document(cocoon://whatever#//blah[foo='bar'])
> 
> you have to consume *ALL* the SAX events that are given to 
> you by the underlying URI.

Well if you write that you get what you deserve, even with pull parsing...

<snip on slightly skewed example/>
 
> This is where pull parsing would really rock, the problem is 
> that such pull parsing is, in fact, a small xml database.

Well with lazy evaluation you only index as you hit a node.  You only hit a
node if someone gives you a reason to descend a particular branch.  Where
this becomes a research topic is what do you do with poorly formed
documents?  AFAIK lazy evaluation only works *easily* if you can assume that
you can split up the input stream using basic XML semantics and not using
detailed evaluation of each and every node...

> And there might be a pretty big overhead in creating a small 
> database (say, the equivalent of Xalan DTM or even that one) 
> in order to facilitate indexing.
> 
> But maybe, that's exactly what Xalan does internally for the 
> document() function, I really don't know.
 
Well this brings up a whole 'nother issue: Xalan... (sigh)

> Still, my point remains: the underlying amount of work the 
> system has to do to come out with a simple variable using 
> document() is incredible compared to the use of a simple 
> method call of a taglib.
> 
> document() keeps on looking like a golden hammer antipattern to me.
 
Well yes and no, if all you need is a "single" variable then don't use
document.  However, how many times do you really need just a single
variable?  (More on that in a moment.)

> I think it would make perfect architectural sense as an 
> interface to access a real xml database, but for accessing 
> something like an xml-ized representation of session content, 
> well, I'm not sure.

Let me for a moment quote part of your "RE: [RT]  improving the session
concept" response:

> I found out that there are three different kinds of flows, not two as I
previously assumed:
> 
>  1) fully stateless -> pure publishing
>  2) fully stateful -> strict flows
>  3) half/half -> everything else!!!
> 
> I'm pretty sure that 90% of web sites done in a modern technology belong
to #3.

Ok, here goes: session by it's very nature is persistent.  Thus, the index
isn't useful for just a single reference to a single variable. In fact, I
will argue (based on your above quote), this is true for 90% of the data
that Cocoon touches:  session, flow and caching are all just different ways
of dealing with the same problem; you need to be able to reference some
amount of data across some period of time in such a way as to optimize the
performance of your application.  Now one can argue that caching extends the
problem from a single user to multiple users (as opposed to flow and
session), but so what?  In many cases I want a piece of "my" data combined
with the shared data, it's really all one big pot with differing rules on
who can access what parts of it when.

What does this mean?  Really, Cocoon is playing mostly where we've got some
amount of data hanging around being persisted in various ways and accessed
in various ways.  What would make more sense is in fact some kind of
"transient" storage database that gives a common method for handling all of
this data across the lifetimes required for all of the data.  What would
really rock is that the parsers and serializers are wired directly into this
database.  In other words, if something like Xalan builds us a nice DTM
indexed data representation why do we throw it all away only to have it
built again moments later?  We need a smart way of saying when to trash what
across users, across requests to optimize processing. We also need
normalized caching across all data in order to optimize memory usage. (Space
and time optimization.)

> Still I see one big value in this: usability during 
> development. It's nice to divide your problem into different 
> pipelines because you can reuse them and tune them as you go 
> and look at them in your browser directly (or with views).
> 
> this is admittedly very attractive.
> 
> but I'm thinking than a jxpath-transformer alternative could 
> well be better.... even if, at that point, the similarity 
> between the jxpath syntax and xslt forces to do stuff like
> 
>  <img src="{id}/{string('{id}')}"/>
> 
> so that the first {id} represents the value of the 'id' 
> element of the input stream of events, and the second one is 
> escaped and further processed by the jxpath transformer which 
> is pipelined after the xslt one.

Don't forget about (the non-standard) xslt:evaluate possibility...

	<img src="{id}/{evaluate('id')}"/>
> 
> Still, even the jxpath has the pretty nasty problem of having 
> to iterate over the whole stream of events to find out which 
> one to substitute. Another performance problem, expecially 
> for namespaced attributes which are very slow to process in 
> SAX since they are not sent as events.
> 
> I really don't know, I think that, at this point, we need 
> numbers to know what's really going on. numbers that compare 
> an XSLT/document() approach against a jxpath approach.
> 
> Anybody wants to volunteer to benchmark this ;-)
>
 
More fun to redesign the guts of Cocoon so that it has it's own
semi-persistent XML document database and to build our own parsers and
serializers on top of that...  ;-)





Mime
View raw message