cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Pieber <andreas.pie...@schmutterer-partner.at>
Subject Re: [cocoon3] Stax Pipelines
Date Tue, 02 Dec 2008 19:18:07 GMT
First of all, my name is Andreas and I'm one of the students working on the StAX 
implementation for cocoon. Therfore hello from my colleagues and me.

Secondly me first post ever to the mailing list of an open source project and 
such a long post to answer. Thank you Sylvain ;) Nevertheless I'm going to try 
my best.

We (if i say we, I mean us students strongly influenced by Reinhard and Steven 
:)) also thought about the problems described by you and came to the same 
conclusion. Therefore we're trying another approach. Pulling StAX-XmlEvents 
through the entire pipeline from the end. 

In other words, if we have a simple pipe of the following form:

Producer - Transformer - Serializer

the Serializer would have in its start method some code like:

while(parent.hasNext()){
	xmlOutputWriter.add(parent.getNext());
}

retrieving the next event on the Transformer in this case and writing it into an 
XmlOutputWriter. The transformer on his self calls the getNext method on the 
Starter (in this case) which retrieves the XmlEvents directly from the 
XmlInputReader.

In this approach the Transformer needs (of course) some kind of buffer since in 
response to one sibling from the parent much new content could be produced by 
the transformer. This content is only retrieved one by one while the next 
pipeline component calls getNext which explains the need for some kind of 
buffer.

Of course this buffer and some more helper code have to be produced to avoid 
code duplication and helping the developer.

One big "problem" in this approach is that the "flow direction of events" is 
completely inverted. This means that StAX and SAX components would not be able 
to work "directly" together. But also in a push-pull approach a conversion 
between StAX and SAX events have to be done and further more this problem could 
be tackled by writing a wrapper or adapters around the SAX components and add 
them to an StAX pipe.

At the moment we're developing a prototype for such a "pull only pipe" to get 
some experience with it.

I hope i was able to point out the nub of our thoughts. So, what do you think?

Andreas

On Tuesday 02 December 2008 17:16:25 Sylvain Wallez wrote:
> Reinhard Pötz wrote:
> > I've had Stax pipelines on my radar for a rather long time because I
> > think that Stax can simplify the writing of transformers a lot.
> > I proposed this idea to Alexander Schatten, an assistant professor at
> > the Vienna University of Technology and he then proposed it to his
> > students.
> >
> > A group of four students accepted to work on this as part of their
> > studies. Steven and I are coaching this group from October to January
> > and the goal is to support Stax pipeline components in Cocoon 3.
> >
> > So far the students learned more about Cocoon 3, Sax, Stax and did some
> > performance comparisons. This week we've entered the phase where the
> > students have to work on the actual Stax pipeline implementation.
> >
> > I asked the students to introduce themselves and also to present the
> > current ideas of how to implement Stax pipelines. So Andreas, Killian,
> > Michael and Jakob, the floor is yours!
>
> I have spent some cycles on this subject and came to the surprising
> conclusion that writing Stax _pipelines_ is actually rather complex.
>
> A Stax transformer pulls events from the previous component in the
> pipeline, which removes the need for the complex state machinery often
> needed for SAX (push) transformers by transforming it in a simple
> function call stack and local variables. This is the main interest of
> Stax vs SAX.
>
> But how does a transformer expose its result to the next component in
> the chain so that this next component can also pull events in the Stax
> style?
>
> When it produces an event, a Stax transformer should put this event
> somewhere so that it can be pulled and processed by the next component.
> But pulling also means the transformer does not suspend its execution
> since it continues pulling events from the previous component. This is
> actually reflected in the Stax API which provides a pull-based
> XMLStreamReader, but only a very SAX-like XMLStreamWriter.
>
> So a Stax transformer is actually a pull input / push output component.
>
> To allow the next component in the pipeline to be also push-based, there
> are 3 solutions (at least this is what I came up with) :
>
> Buffering
> ---------
> The XMLStreamWriter where the transformer writes to buffers all events
> in a data structure similar to our XMLByteStreamCompiler, that can be
> used as a XMLStreamReader by the next component in the chain. The
> pipeline object then has to call some execute() method on every
> component in the pipeline in sequence, after having provided them with
> the proper buffer-based reader and writer.
>
> Execution is single-threaded, which fits well with all the non
> threadsafe classes and threadlocals we usually have in web applications,
> but requires buffering and thus somehow defeats the purpose of
> stream-based processing and can be simply not possible to process large
> documents.
>
> Note however that because it is single-threaded, we can work with two
> buffers (one for input, one for output) that are reused whatever the
> number of components in the pipeline.
>
> Multithreading
> --------------
> Each component of the pipeline runs in a separate thread, and writes its
> output into an event queue that is consumed asynchronously by the next
> component in the pipeline. The event queue is presented as an
> XMLStreamReader to the next component.
>
> This approach requires very little buffering (and we can even have an
> upper bound on the event queue size). It also uses nicely the parallel
> proccessing capabilities of multi-core CPUs, although in web apps the
> parallelism is also handled by concurrent http requests. This is
> typically the approach that would be used with Erlang or Scala actors.
>
> Multithreading has some issues though, since the servlet API more or
> less implies that a single thread processes the request and we may have
> some concurrency issues. Web app developers also take single threading
> as a basic assumption and use threadlocals here and there.
>
> This approach also prevents the reuse of char[] buffers as is usually
> done by XML parsers since events are processed asychronously. All char[]
> have to be copied, but this is a minor issue.
>
> Continuations
> -------------
> When a transformer sends an event to the next component in the chain,
> its execution is suspended and captured in a continuation. The
> continuation of the next pipeline component is resumed until it has
> consumed the event. We then switch back to the current component until
> it produces an event, etc, etc.
>
> This approach is single-threaded and so avoids the concurrency issues
> mentioned above, and also avoids buffering. But there is certainly a
> high overhead with the large number of continuation capturing/resuming.
> This number can be reduced though is we have some level of buffering to
> allow processing of several events in one capture/resume cycle.
>
> It also requires all the bytecode of transfomers to be instrumented for
> continuations, which in itself adds quite some memory and processing
> overhead. Torsten also posted on this subject quite long ago [1].
>
>
> Conclusion
> ----------
> All things considered, I came to the conclusion that a full Stax
> pipeline either requires buffering to be reliable (but we're no more
> streaming), or requires very careful inspection of all components for
> multi-threading issues.
>
> So in the end, Stax probably has to be considered as a helper _inside_ a
> component to ease processing : buffer all SAX input, then pull the
> received events to avoid complex state automata.
>
> Looks like I'm in a "long mail" period and I hope I haven't lost anybody
> here :-)
>
> So, what do you think?
>
> Sylvain
>
> [1] http://vafer.org/blog/20060807003609

-- 
SCHMUTTERER+PARTNER Information Technology GmbH

Hiessbergergasse 1
A-3002 Purkersdorf

T   +43 (0) 69911127344
F   +43 (2231) 61899-99
mail to: andreas.pieber@schmutterer-partner.at

Mime
View raw message