Return-Path: Delivered-To: apmail-cocoon-dev-archive@www.apache.org Received: (qmail 64027 invoked from network); 28 Dec 2008 08:48:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Dec 2008 08:48:16 -0000 Received: (qmail 327 invoked by uid 500); 28 Dec 2008 08:48:15 -0000 Delivered-To: apmail-cocoon-dev-archive@cocoon.apache.org Received: (qmail 259 invoked by uid 500); 28 Dec 2008 08:48:15 -0000 Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: Reply-To: dev@cocoon.apache.org List-Id: Delivered-To: mailing list dev@cocoon.apache.org Received: (qmail 248 invoked by uid 99); 28 Dec 2008 08:48:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 Dec 2008 00:48:15 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [88.198.46.98] (HELO indoqa.com) (88.198.46.98) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 28 Dec 2008 08:48:07 +0000 Received: from [192.168.0.132] (chello062178215074.10.15.vie.surfer.at [62.178.215.74]) by indoqa.com (Postfix) with ESMTP id D1B17256809 for ; Sun, 28 Dec 2008 09:47:44 +0100 (CET) Message-ID: <49573D2F.4050307@indoqa.com> Date: Sun, 28 Dec 2008 09:47:43 +0100 From: Steven Dolg User-Agent: Thunderbird 2.0.0.18 (Windows/20081105) MIME-Version: 1.0 To: dev@cocoon.apache.org Subject: Re: [C3] StAX research reveiled! References: <49520644.30108@gmail.com> <4956B9B2.7050300@apache.org> <495725CD.9050809@indoqa.com> <200812280836.51897.andreas.pieber@schmutterer-partner.at> In-Reply-To: <200812280836.51897.andreas.pieber@schmutterer-partner.at> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Andreas Pieber schrieb: > On Sunday 28 December 2008 08:07:57 Steven Dolg wrote: > >> Sylvain Wallez schrieb: >> >>> Andreas Pieber wrote: >>> >>>> On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote: >>>> >>>>> Michael Seydl wrote: >>>>> >>>>>> Hi all! >>>>>> >>>>>> One more mail for the student group! Behind this lurid topic hides our >>>>>> evaluation of the latest XML processing technologies regarding their >>>>>> usability in Cocoon3 (especially if there are suited to be used in a >>>>>> streaming pipeline). >>>>>> As it's commonly know we decided to use StAX as our weapon of choice >>>>>> to do the XML, but this paper should explain the whys and hows and >>>>>> especially the way we took to come to our decision, which resulted in >>>>>> using the very same API. >>>>>> Eleven pages should be a to big read and it contains all necessary >>>>>> links to all the APIs we evaluated and also line wise our two cents >>>>>> about the API we observed. Concludingly we also tried to show the >>>>>> difference between the currently used SAX and the of us proposed StAX >>>>>> API. >>>>>> >>>>>> I hope this work sheds some light on our decision making and taking >>>>>> and that someone dares to read it. >>>>>> >>>>>> That's from me, I wish you all a pleasant and very merry Christmas! >>>>>> >>>>>> Regards, >>>>>> Michael Seydl >>>>>> >>>>> Good work and interesting read, but don't agree with some of its >>>>> statements! >>>>> >>>>> The big if/else or switch statements mentioned as a drawback of the >>>>> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API, >>>>> since >>>>> it provides abstract events whose type needs also to be inspected to >>>>> decide what to do. >>>>> >>>> Of course, you're right! >>>> >>>> >>>>> The drawbacks of the stream API compared to the event API are, as you >>>>> mention, that some methods of XMLStreamReader will throw an exception >>>>> depending on the current event's type and that the event is not >>>>> represented as a data structure that can be passed directly to the next >>>>> element in the pipeline or stored in an event buffer. >>>>> >>>>> The first point (exceptions) should not happen, unless the code is >>>>> buggy >>>>> and tries to get information that doesn't belong to the context. I have >>>>> used many times the cursor API and haven't found any usability problems >>>>> with it. >>>>> >>>> Also here you're right, but IMHO it is not necessary to add another >>>> source for bugs if not required... >>>> >>> Well, there are so many other sources of bugs... I wouldn't sacrifice >>> efficiency for bad usage of an API. And when dealing with XML, people >>> should know that e.g. calling getAttribute() for a text event is >>> meaningless. >>> >>> >>>>> The second point (lack of data structure) can be easily solved by using >>>>> an XMLEventAllocator [1] that creates an XMLEvent from the current >>>>> state >>>>> of an XMLStreamReader. >>>>> >>>> Mhm but if we use an XMLEventAllocator, y not directly use the >>>> StAXEvent api? >>>> >>> Sorry, I wasn't clear: *if* and XMLEvent is needed, then it's easy to >>> get it from a stream. >>> >>> >>>>> The event API has the major drawback of always creating a new object >>>>> for >>>>> every event (since as the javadoc says "events may be cached and >>>>> referenced after the parse has completed"). This can lead to a big >>>>> strain on the memory system and garbage collection on a busy >>>>> application. >>>>> >>>> Thats right, but having in mind to create a pull pipe, where the >>>> serializer pulls each event from the producer through each >>>> transformer and writing it to an output stream we don't have any >>>> other possibility than creating an object for each event. >>>> >>>> Think about it a little more in detail. To be able to pull each event >>>> you have to have the possibility to call a method looking like: >>>> >>>> Object next(); >>>> >>>> on the parent of the pipelineComponent. Doing it in a StAX cursor way >>>> means to increase the complexity from one method to 10 or more which >>>> have to be available through the parent... >>>> >>> Not necessarily, depending on how the API is designed. Let's give it a >>> try: >>> >>> /** A generator can pull events from somewhere and writes them to an >>> output */ >>> interface Generator { >>> /** Do we still have something to produce? */ >>> boolean hasNext(); >>> >>> /** Do some processing and produce some output */ >>> void pull(XMLStreamWriter output); >>> } >>> >>> /** A transformer is a generator that has an XML input */ >>> interface Transformer extends Generator { >>> void setInput(XMLStreamReader input); >>> } >>> >>> class StaxFIFO implements XMLStreamReader, XMLStreamWriter { >>> Generator generator; >>> StaxFIFO(Generator generator) { >>> this.generator = generator; >>> } >>> >>> // Implement all XMLStreamWriter methods as writing to an >>> // internal stream FIFO buffer >>> >>> // Implement all XMLStreamReader methods as reading from an >>> // internal stream FIFO buffer, except hasNext() below: >>> >>> boolean hasNext() { >>> while (eventBufferIsEmpty() && generator.hasNext()) { >>> // Ask the generator to produce some events >>> generator.pull(this); >>> } >>> return !eventBufferIsEmpty(); >>> } >>> } >>> >>> Building and executing a pipeline is then rather simple : >>> >>> class Pipeline { >>> Generator generator; >>> Transformer transformers[]; >>> XMLStreamWriter serializer; >>> >>> void execute() { >>> Generator last = generator; >>> for (Transformer tr : transformers) { >>> tr.setInput(new StaxFIFO(previous); >>> last = tr; >>> } >>> >>> // Pull from the whole chain to the serializer >>> while(last.hasNext()) { >>> last.pull(serializer); >>> } >>> } >>> } >>> >>> Every component gets an XMLStreamWriter where to write their output >>> (in a style equivalent to SAX), and transformers get an >>> XMLStreamReader where to get their input from. >>> >>> The programming model is then very simple: for every call to pull(), >>> read something in, process it and produce the corresponding output >>> (optional, since end of processing is defined by hasNext()). The >>> buffers used to connect the various components allow pull() to read >>> and process a set of related events, resulting in any number of events >>> being written to the buffer. >>> >>> >>>>> So the cursor API is the most efficient IMO when it comes to consuming >>>>> data, since it doesn't require creating useless event objects. >>>>> >>>>> Now in a pipeline context, we will want to transmit events untouched >>>>> from one component to the next one, using some partial buffering as >>>>> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be >>>>> the natural solution for this, but would require the use of events at >>>>> the pipeline API level, with their associated costs mentioned above. >>>>> >>>> I'm not sure if I get the point here, but we do not like to >>>> "transmit" events. They are pulled. Therefore in most cases we simply >>>> do not need a buffer, since events could be directly returned. >>>> >>> In the previous discussions, it was considered that pulling from the >>> previous component would lead that component to process in one pull >>> call a number or related events, to avoid state handling that would >>> make it even more complex than SAX. And processing these related >>> events will certainly mean returning ("transmitting" as I said) >>> several events. So buffering *is* needed in most non-trivial cases. >>> >>> >>>>> So what should be used for pipelines ? My impression is that we should >>>>> stick to the most efficient API and build the simple tools needed to >>>>> buffer events from a StreamReader, taking inspiration from the >>>>> XMLBytestreamCompiler we already have. >>>>> >>>> Maybe some events could be avoided using the cursor API, but IMO the >>>> performance we could get is not worth the simplicity we sacrifice... >>>> >>> I don't agree that we sacrifice simplicity. With the above, the >>> developer only deals with XMLStreamWriter and XMLStreamReader objects, >>> and never has to implement them. >>> >>> We just need an efficient StaxFIFO class, which people shouldn't care >>> about since it is completely hidden in the Pipeline object. >>> >>> Thoughts? >>> >> Well the approach outlined about will certainly work. >> >> Basically you're providing a buffer between every pair of components and >> fill it as needed. >> But you need to implement both XMLStreamWriter and XMLStreamReader and >> optimize that for any possible thing a transformer might do. >> In order to buffer all the data from the components you will have to >> create some objects as well - I guess you will end up with something >> like the XMLEvent and maintaining a list of them in the StaxFIFO. >> That's why I think an efficient (as in faster than the Event API) >> implementation of the StaxFIFO is difficult to make. >> >> On the other hand I do think that the cursor API is quite a bit harder >> to use. >> As stated in the Javadoc of XMLStreamReader it is the lowest level for >> reading XML data - which usually means more logic in the code using the >> API and more knowledge in the head of the developer reading/writing the >> code is required. >> So I second Andreas' statement that we will sacrifice simplicity for (a >> small amount of ?) performance. >> >> >> The other thing is that - at least the way you suggested - we would need >> a special implementation of the Pipeline interface. >> That is something that compromises the intention behind having a >> Pipeline API. >> Right now we can use the new StAX components and simply put them into >> any of the Pipeline implementations we already have. >> Sacrificing this is completely out of the question IMO. >> >> > > I did a little (and quite dirty) implementation of Sylvains ideas to be able to > see all the advantages and drawbacks of such an approach. I came to the > following conclusions: > > First of all we do not need to change the interfaces of the pipeline api. It is > possible to do it in a quite similar way as we did it for the stax event > iteration api. > > Writing code with the streaming api making things a little bit more complicated > than with the XML-Event object. Instead of working with three methods and an > object you have to handle x methods (but this is my personal opinion). > > I'm with Steven that you'll end with a list of XMLEvent like objects in the > StAXFIFO buffer (my prototype does :) ). And this makes things worse... The > tasks which could/would be handled by such an api could be reduced to the > following cases: Removing XMLEvents (nodes, how ever...), adding XMLEvents, > changing XMLEvents and simply letting them through. With the approach we are > using at the moment you're required to buffer events in the case of adding them. > In all other cases a buffer is NOT required (thanks to the navigator idea :) ). > And thats the reason making a StreamReader-StreamWriter StAXFifo-Buffering > approach worse than the XMLEvent approach. Think about a situation with x > transformers and an XML where 90% simply dont have to be handled. In this case > for each transformer, for each node not touched, an object has to be created in > the buffer whereas one object is enough in the XMLEvent object approach. > > To sum it up: We add (ok, not too much, but we do) another layer of complexity > without increasing the performance (as it may look like at a first glance). IMHO > we'll end up with more created objects than in an Event-Iterator approach. > I am still convinced the iterator API is the way to go. The recommendations from Sun (as cited in the PDF attached to the inital mail) IMO clearly point in that direction: * If you are programming for a particularly memory-constrained environment, like J2ME, you can make smaller, more efficient code with the cursor API. * If performance is your highest priority--for example, when creating low-level libraries or infrastructure--the cursor API is more efficient. * If you want to create XML processing pipelines, use the iterator API. * If you want to modify the event stream, use the iterator API. * If you want to your application to be able to handle pluggable processing of the event stream, use the iterator API. * In general, if you do not have a strong preference one way or the other, using the iterator API is recommended because it is more flexible and extensible, thereby "future-proofing" your applications. We're not operating in a particularly memory-constrained environment. IMO performance is not the *highest* priority (it is important of course). But we want to make processing pipelines, containing pluggable components in order to modify the event stream and use StAX as a more flexible and intuitive *alternative* to SAX. If you're really all about performance you should probably stick with SAX anyway. Well we can still make a shootout between the cursor and the iterator API at a later time. Actually that would be really interesting and demonstrate one of the most important aspects (IMO) about Cocoon 3: Fill your pipelines with whatever you want - and if the components are nice and sweet others might like them as well. > Andreas > > >> Steven >> >> >>> Sylvain >>>