Return-Path: Delivered-To: apmail-cocoon-dev-archive@www.apache.org Received: (qmail 25760 invoked from network); 2 Dec 2008 19:21:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Dec 2008 19:21:31 -0000 Received: (qmail 28442 invoked by uid 500); 2 Dec 2008 19:21:41 -0000 Delivered-To: apmail-cocoon-dev-archive@cocoon.apache.org Received: (qmail 28358 invoked by uid 500); 2 Dec 2008 19:21:41 -0000 Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: Reply-To: dev@cocoon.apache.org List-Id: Delivered-To: mailing list dev@cocoon.apache.org Delivered-To: moderator for dev@cocoon.apache.org Received: (qmail 26052 invoked by uid 99); 2 Dec 2008 19:19:42 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) From: Andreas Pieber Organization: SCHMUTTERER+PARTNER Information Technologoy GmbH To: dev@cocoon.apache.org Subject: Re: [cocoon3] Stax Pipelines Date: Tue, 2 Dec 2008 20:18:07 +0100 User-Agent: KMail/1.10.1 (Linux/2.6.27-9-generic; KDE/4.1.2; i686; ; ) References: <49352208.9010008@apache.org> <49355F59.40406@apache.org> In-Reply-To: <49355F59.40406@apache.org> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart1970278.32pGTHqDSB"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200812022018.11581.andreas.pieber@schmutterer-partner.at> X-OriginalArrivalTime: 02 Dec 2008 19:18:27.0870 (UTC) FILETIME=[C12E03E0:01C954B2] X-Virus-Checked: Checked by ClamAV on apache.org --nextPart1970278.32pGTHqDSB Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline =46irst of all, my name is Andreas and I'm one of the students working on t= he StAX=20 implementation for cocoon. Therfore hello from my colleagues and me. Secondly me first post ever to the mailing list of an open source project a= nd=20 such a long post to answer. Thank you Sylvain ;) Nevertheless I'm going to = try=20 my best. We (if i say we, I mean us students strongly influenced by Reinhard and Ste= ven=20 :)) also thought about the problems described by you and came to the same=20 conclusion. Therefore we're trying another approach. Pulling StAX-XmlEvents= =20 through the entire pipeline from the end.=20 In other words, if we have a simple pipe of the following form: Producer - Transformer - Serializer the Serializer would have in its start method some code like: while(parent.hasNext()){ xmlOutputWriter.add(parent.getNext()); } retrieving the next event on the Transformer in this case and writing it in= to an=20 XmlOutputWriter. The transformer on his self calls the getNext method on th= e=20 Starter (in this case) which retrieves the XmlEvents directly from the=20 XmlInputReader. In this approach the Transformer needs (of course) some kind of buffer sinc= e in=20 response to one sibling from the parent much new content could be produced = by=20 the transformer. This content is only retrieved one by one while the next=20 pipeline component calls getNext which explains the need for some kind of=20 buffer. Of course this buffer and some more helper code have to be produced to avoi= d=20 code duplication and helping the developer. One big "problem" in this approach is that the "flow direction of events" i= s=20 completely inverted. This means that StAX and SAX components would not be a= ble=20 to work "directly" together. But also in a push-pull approach a conversion= =20 between StAX and SAX events have to be done and further more this problem c= ould=20 be tackled by writing a wrapper or adapters around the SAX components and a= dd=20 them to an StAX pipe. At the moment we're developing a prototype for such a "pull only pipe" to g= et=20 some experience with it. I hope i was able to point out the nub of our thoughts. So, what do you thi= nk? Andreas On Tuesday 02 December 2008 17:16:25 Sylvain Wallez wrote: > Reinhard P=F6tz wrote: > > I've had Stax pipelines on my radar for a rather long time because I > > think that Stax can simplify the writing of transformers a lot. > > I proposed this idea to Alexander Schatten, an assistant professor at > > the Vienna University of Technology and he then proposed it to his > > students. > > > > A group of four students accepted to work on this as part of their > > studies. Steven and I are coaching this group from October to January > > and the goal is to support Stax pipeline components in Cocoon 3. > > > > So far the students learned more about Cocoon 3, Sax, Stax and did some > > performance comparisons. This week we've entered the phase where the > > students have to work on the actual Stax pipeline implementation. > > > > I asked the students to introduce themselves and also to present the > > current ideas of how to implement Stax pipelines. So Andreas, Killian, > > Michael and Jakob, the floor is yours! > > I have spent some cycles on this subject and came to the surprising > conclusion that writing Stax _pipelines_ is actually rather complex. > > A Stax transformer pulls events from the previous component in the > pipeline, which removes the need for the complex state machinery often > needed for SAX (push) transformers by transforming it in a simple > function call stack and local variables. This is the main interest of > Stax vs SAX. > > But how does a transformer expose its result to the next component in > the chain so that this next component can also pull events in the Stax > style? > > When it produces an event, a Stax transformer should put this event > somewhere so that it can be pulled and processed by the next component. > But pulling also means the transformer does not suspend its execution > since it continues pulling events from the previous component. This is > actually reflected in the Stax API which provides a pull-based > XMLStreamReader, but only a very SAX-like XMLStreamWriter. > > So a Stax transformer is actually a pull input / push output component. > > To allow the next component in the pipeline to be also push-based, there > are 3 solutions (at least this is what I came up with) : > > Buffering > --------- > The XMLStreamWriter where the transformer writes to buffers all events > in a data structure similar to our XMLByteStreamCompiler, that can be > used as a XMLStreamReader by the next component in the chain. The > pipeline object then has to call some execute() method on every > component in the pipeline in sequence, after having provided them with > the proper buffer-based reader and writer. > > Execution is single-threaded, which fits well with all the non > threadsafe classes and threadlocals we usually have in web applications, > but requires buffering and thus somehow defeats the purpose of > stream-based processing and can be simply not possible to process large > documents. > > Note however that because it is single-threaded, we can work with two > buffers (one for input, one for output) that are reused whatever the > number of components in the pipeline. > > Multithreading > -------------- > Each component of the pipeline runs in a separate thread, and writes its > output into an event queue that is consumed asynchronously by the next > component in the pipeline. The event queue is presented as an > XMLStreamReader to the next component. > > This approach requires very little buffering (and we can even have an > upper bound on the event queue size). It also uses nicely the parallel > proccessing capabilities of multi-core CPUs, although in web apps the > parallelism is also handled by concurrent http requests. This is > typically the approach that would be used with Erlang or Scala actors. > > Multithreading has some issues though, since the servlet API more or > less implies that a single thread processes the request and we may have > some concurrency issues. Web app developers also take single threading > as a basic assumption and use threadlocals here and there. > > This approach also prevents the reuse of char[] buffers as is usually > done by XML parsers since events are processed asychronously. All char[] > have to be copied, but this is a minor issue. > > Continuations > ------------- > When a transformer sends an event to the next component in the chain, > its execution is suspended and captured in a continuation. The > continuation of the next pipeline component is resumed until it has > consumed the event. We then switch back to the current component until > it produces an event, etc, etc. > > This approach is single-threaded and so avoids the concurrency issues > mentioned above, and also avoids buffering. But there is certainly a > high overhead with the large number of continuation capturing/resuming. > This number can be reduced though is we have some level of buffering to > allow processing of several events in one capture/resume cycle. > > It also requires all the bytecode of transfomers to be instrumented for > continuations, which in itself adds quite some memory and processing > overhead. Torsten also posted on this subject quite long ago [1]. > > > Conclusion > ---------- > All things considered, I came to the conclusion that a full Stax > pipeline either requires buffering to be reliable (but we're no more > streaming), or requires very careful inspection of all components for > multi-threading issues. > > So in the end, Stax probably has to be considered as a helper _inside_ a > component to ease processing : buffer all SAX input, then pull the > received events to avoid complex state automata. > > Looks like I'm in a "long mail" period and I hope I haven't lost anybody > here :-) > > So, what do you think? > > Sylvain > > [1] http://vafer.org/blog/20060807003609 =2D-=20 SCHMUTTERER+PARTNER Information Technology GmbH Hiessbergergasse 1 A-3002 Purkersdorf T +43 (0) 69911127344 =46 +43 (2231) 61899-99 mail to: andreas.pieber@schmutterer-partner.at --nextPart1970278.32pGTHqDSB Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEABECAAYFAkk1ifAACgkQtWhMwv2FMZLKtACfX3rEQsvFiiviklA0mKglxSW3 BkAAn1WZDynhfCjin/Qa/Q4TA24MyaS7 =TlzO -----END PGP SIGNATURE----- --nextPart1970278.32pGTHqDSB--