cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simone Gianni <sim...@semeru.it>
Subject R: Re: [cocoon3] Stax Pipelines
Date Fri, 05 Dec 2008 12:41:45 GMT
Hi all, 
since Stax is an inversion of the call flow, what we have is an inversion of the advantages
and disadvantages we had with SAX. 

I'll try to explain it better. Suppose we have two schemas, one contains "LONG" elements,
with lots of children and stuff inside, the other contains "SHORT" elements, with just as
attribute "id". Now suppose it is possible to translate from one to the other, for example
it could be that LONG stuff is stored on the database, and SHORT is a placeholder pointing
to LONGs on the database. 

Now, we want to write two transformers. One is SHORT to LONG, which will perform some selects
on the database and expand those SHORT into LONG. The other one stores stuff on the database,
and convert LONG to SHORT. 

As we all know (the i18n transformer is a good example), in SAX, transforming from LONG to
SHORT is a pain, cause we need to keep the state between multiple calls. In our example, if
the LONG to SHORT transformer is a SAX based one, we would need to buffer all the LONG content,
then store it on the DB and then emit a single SHORT. That buffering is our state. 

Instead, this kind of transformation is quite easy in a Stax transformer, cause when we encounter
a LONG we can just fetch all the data we need, and perform everything we need to do in a single
method, without having to preserve the state across different calls. Such a transformer in
Stax could be nearly stateless/threadsafe from an XML point of view (the database connection
would be state, but that's just for the sake of the example). 

However suppose we are doing the SHORT to LONG translation. In this case, using SAX is by
fax simpler than Stax. In fact, when we encounter a SHORT, we can fetch stuff from the DB
and start bombing the next handler in the pipeline with elements as soon as they arrive from
the DB. Doing it in Stax instead would require us to have a state, cause we would need to
buffer data from the DB, and serve that data to the subseguent calls from our Stax consumer
until the buffer is empty. Exactly the opposite problem of a SAX pipeline. 

The SAX part of these example is nothing new to Cocoon. We already have an infrastructure
for buffering SAX events when we need to in our transformers, in extremis even building a
DOM out of it (which we could consider the most versatile and expensive form of buffering).
Couldn't we just provide such a buffer for those Stax based transformers when they need it?


This would be an intermediate solution, cause there would be an easy way to keep the state
during Stax calls (as it was for SAX, but the opposite way around), it would still be a pure
Stax based pipeline, buffering would be limited to the bare minimum required by the transformer,
and could be avoided at all reimplementing the transformer with more complex state logic if
needed for performance reasons. 

This is not a solution to the SAX<->Stax cooperation problem, but my two cents on the
"Is implementing a Stax based transformer easier or more complicated than a Sax one" discussion
:) 

Simone 



----- Messaggio originale ----- 
Da: Sylvain Wallez <sylvain@apache.org> 
A: dev@cocoon.apache.org 
Posta Inviata: martedì 2 dicembre 2008 17.16.25 GMT+0100 Europe/Berlin 
Oggetto: Re: [cocoon3] Stax Pipelines 

Reinhard Pötz wrote: 
> I've had Stax pipelines on my radar for a rather long time because I 
> think that Stax can simplify the writing of transformers a lot. 
> I proposed this idea to Alexander Schatten, an assistant professor at 
> the Vienna University of Technology and he then proposed it to his 
> students. 
> 
> A group of four students accepted to work on this as part of their 
> studies. Steven and I are coaching this group from October to January 
> and the goal is to support Stax pipeline components in Cocoon 3. 
> 
> So far the students learned more about Cocoon 3, Sax, Stax and did some 
> performance comparisons. This week we've entered the phase where the 
> students have to work on the actual Stax pipeline implementation. 
> 
> I asked the students to introduce themselves and also to present the 
> current ideas of how to implement Stax pipelines. So Andreas, Killian, 
> Michael and Jakob, the floor is yours! 
> 

I have spent some cycles on this subject and came to the surprising 
conclusion that writing Stax _pipelines_ is actually rather complex. 

A Stax transformer pulls events from the previous component in the 
pipeline, which removes the need for the complex state machinery often 
needed for SAX (push) transformers by transforming it in a simple 
function call stack and local variables. This is the main interest of 
Stax vs SAX. 

But how does a transformer expose its result to the next component in 
the chain so that this next component can also pull events in the Stax 
style? 

When it produces an event, a Stax transformer should put this event 
somewhere so that it can be pulled and processed by the next component. 
But pulling also means the transformer does not suspend its execution 
since it continues pulling events from the previous component. This is 
actually reflected in the Stax API which provides a pull-based 
XMLStreamReader, but only a very SAX-like XMLStreamWriter. 

So a Stax transformer is actually a pull input / push output component. 

To allow the next component in the pipeline to be also push-based, there 
are 3 solutions (at least this is what I came up with) : 

Buffering 
--------- 
The XMLStreamWriter where the transformer writes to buffers all events 
in a data structure similar to our XMLByteStreamCompiler, that can be 
used as a XMLStreamReader by the next component in the chain. The 
pipeline object then has to call some execute() method on every 
component in the pipeline in sequence, after having provided them with 
the proper buffer-based reader and writer. 

Execution is single-threaded, which fits well with all the non 
threadsafe classes and threadlocals we usually have in web applications, 
but requires buffering and thus somehow defeats the purpose of 
stream-based processing and can be simply not possible to process large 
documents. 

Note however that because it is single-threaded, we can work with two 
buffers (one for input, one for output) that are reused whatever the 
number of components in the pipeline. 

Multithreading 
-------------- 
Each component of the pipeline runs in a separate thread, and writes its 
output into an event queue that is consumed asynchronously by the next 
component in the pipeline. The event queue is presented as an 
XMLStreamReader to the next component. 

This approach requires very little buffering (and we can even have an 
upper bound on the event queue size). It also uses nicely the parallel 
proccessing capabilities of multi-core CPUs, although in web apps the 
parallelism is also handled by concurrent http requests. This is 
typically the approach that would be used with Erlang or Scala actors. 

Multithreading has some issues though, since the servlet API more or 
less implies that a single thread processes the request and we may have 
some concurrency issues. Web app developers also take single threading 
as a basic assumption and use threadlocals here and there. 

This approach also prevents the reuse of char[] buffers as is usually 
done by XML parsers since events are processed asychronously. All char[] 
have to be copied, but this is a minor issue. 

Continuations 
------------- 
When a transformer sends an event to the next component in the chain, 
its execution is suspended and captured in a continuation. The 
continuation of the next pipeline component is resumed until it has 
consumed the event. We then switch back to the current component until 
it produces an event, etc, etc. 

This approach is single-threaded and so avoids the concurrency issues 
mentioned above, and also avoids buffering. But there is certainly a 
high overhead with the large number of continuation capturing/resuming. 
This number can be reduced though is we have some level of buffering to 
allow processing of several events in one capture/resume cycle. 

It also requires all the bytecode of transfomers to be instrumented for 
continuations, which in itself adds quite some memory and processing 
overhead. Torsten also posted on this subject quite long ago [1]. 


Conclusion 
---------- 
All things considered, I came to the conclusion that a full Stax 
pipeline either requires buffering to be reliable (but we're no more 
streaming), or requires very careful inspection of all components for 
multi-threading issues. 

So in the end, Stax probably has to be considered as a helper _inside_ a 
component to ease processing : buffer all SAX input, then pull the 
received events to avoid complex state automata. 

Looks like I'm in a "long mail" period and I hope I haven't lost anybody 
here :-) 

So, what do you think? 

Sylvain 

[1] http://vafer.org/blog/20060807003609 

-- 
Sylvain Wallez - http://bluxte.net 


Mime
View raw message