Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cocoon.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <49573D2F.4050307@indoqa.com>
Date: Sun, 28 Dec 2008 09:47:43 +0100
From: Steven Dolg <steven.dolg@indoqa.com>
User-Agent: Thunderbird 2.0.0.18 (Windows/20081105)
MIME-Version: 1.0
To: dev@cocoon.apache.org
Subject: Re: [C3] StAX research reveiled!
References: <49520644.30108@gmail.com> <4956B9B2.7050300@apache.org>
 <495725CD.9050809@indoqa.com>
 <200812280836.51897.andreas.pieber@schmutterer-partner.at>
In-Reply-To: <200812280836.51897.andreas.pieber@schmutterer-partner.at>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit

Andreas Pieber schrieb:
> On Sunday 28 December 2008 08:07:57 Steven Dolg wrote:
>   
>> Sylvain Wallez schrieb:
>>     
>>> Andreas Pieber wrote:
>>>       
>>>> On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote:
>>>>         
>>>>> Michael Seydl wrote:
>>>>>           
>>>>>> Hi all!
>>>>>>
>>>>>> One more mail for the student group! Behind this lurid topic hides our
>>>>>> evaluation of the latest XML processing technologies regarding their
>>>>>> usability in Cocoon3 (especially if there are suited to be used in a
>>>>>> streaming pipeline).
>>>>>> As it's commonly know we decided to use StAX as our weapon of choice
>>>>>> to do the XML, but this paper should explain the whys and hows and
>>>>>> especially the way we took to come to our decision, which resulted in
>>>>>> using the very same API.
>>>>>> Eleven pages should be a to big read and it contains all necessary
>>>>>> links to all the APIs we evaluated and also line wise our two cents
>>>>>> about the API we observed. Concludingly we also tried to show the
>>>>>> difference between the currently used SAX and the of us proposed StAX
>>>>>> API.
>>>>>>
>>>>>> I hope this work sheds some light on our decision making and taking
>>>>>> and that someone dares to read it.
>>>>>>
>>>>>> That's from me, I wish you all a pleasant and very merry Christmas!
>>>>>>
>>>>>> Regards,
>>>>>> Michael Seydl
>>>>>>             
>>>>> Good work and interesting read, but don't agree with some of its
>>>>> statements!
>>>>>
>>>>> The big if/else or switch statements mentioned as a drawback of the
>>>>> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API,
>>>>> since
>>>>> it provides abstract events whose type needs also to be inspected to
>>>>> decide what to do.
>>>>>           
>>>> Of course, you're right!
>>>>
>>>>         
>>>>> The drawbacks of the stream API compared to the event API are, as you
>>>>> mention, that some methods of XMLStreamReader will throw an exception
>>>>> depending on the current event's type and that the event is not
>>>>> represented as a data structure that can be passed directly to the next
>>>>> element in the pipeline or stored in an event buffer.
>>>>>
>>>>> The first point (exceptions) should not happen, unless the code is
>>>>> buggy
>>>>> and tries to get information that doesn't belong to the context. I have
>>>>> used many times the cursor API and haven't found any usability problems
>>>>> with it.
>>>>>           
>>>> Also here you're right, but IMHO it is not necessary to add another
>>>> source for bugs if not required...
>>>>         
>>> Well, there are so many other sources of bugs... I wouldn't sacrifice
>>> efficiency for bad usage of an API. And when dealing with XML, people
>>> should know that e.g. calling getAttribute() for a text event is
>>> meaningless.
>>>
>>>       
>>>>> The second point (lack of data structure) can be easily solved by using
>>>>> an XMLEventAllocator [1] that creates an XMLEvent from the current
>>>>> state
>>>>> of an XMLStreamReader.
>>>>>           
>>>> Mhm but if we use an XMLEventAllocator, y not directly use the
>>>> StAXEvent api?
>>>>         
>>> Sorry, I wasn't clear: *if* and XMLEvent is needed, then it's easy to
>>> get it from a stream.
>>>
>>>       
>>>>> The event API has the major drawback of always creating a new object
>>>>> for
>>>>> every event (since as the javadoc says "events may be cached and
>>>>> referenced after the parse has completed"). This can lead to a big
>>>>> strain on the memory system and garbage collection on a busy
>>>>> application.
>>>>>           
>>>> Thats right, but having in mind to create a pull pipe, where the
>>>> serializer pulls each event from the producer through each
>>>> transformer and writing it to an output stream we don't have any
>>>> other possibility than creating an object for each event.
>>>>
>>>> Think about it a little more in detail. To be able to pull each event
>>>> you have to have the possibility to call a method looking like:
>>>>
>>>> Object next();
>>>>
>>>> on the parent of the pipelineComponent. Doing it in a StAX cursor way
>>>> means to increase the complexity from one method to 10 or more which
>>>> have to be available through the parent...
>>>>         
>>> Not necessarily, depending on how the API is designed. Let's give it a
>>> try:
>>>
>>> /** A generator can pull events from somewhere and writes them to an
>>> output */
>>> interface Generator {
>>>    /** Do we still have something to produce? */
>>>    boolean hasNext();
>>>
>>>    /** Do some processing and produce some output */
>>>    void pull(XMLStreamWriter output);
>>> }
>>>
>>> /** A transformer is a generator that has an XML input */
>>> interface Transformer extends Generator {
>>>    void setInput(XMLStreamReader input);
>>> }
>>>
>>> class StaxFIFO implements XMLStreamReader, XMLStreamWriter {
>>>    Generator generator;
>>>    StaxFIFO(Generator generator) {
>>>        this.generator = generator;
>>>    }
>>>
>>>    // Implement all XMLStreamWriter methods as writing to an
>>>    // internal stream FIFO buffer
>>>
>>>    // Implement all XMLStreamReader methods as reading from an
>>>    // internal stream FIFO buffer, except hasNext() below:
>>>
>>>    boolean hasNext() {
>>>        while (eventBufferIsEmpty() && generator.hasNext()) {
>>>            // Ask the generator to produce some events
>>>            generator.pull(this);
>>>        }
>>>        return !eventBufferIsEmpty();
>>>    }
>>> }
>>>
>>> Building and executing a pipeline is then rather simple :
>>>
>>> class Pipeline {
>>>    Generator generator;
>>>    Transformer transformers[];
>>>    XMLStreamWriter serializer;
>>>
>>>    void execute() {
>>>        Generator last = generator;
>>>        for (Transformer tr : transformers) {
>>>            tr.setInput(new StaxFIFO(previous);
>>>            last = tr;
>>>        }
>>>
>>>        // Pull from the whole chain to the serializer
>>>        while(last.hasNext()) {
>>>            last.pull(serializer);
>>>        }
>>>    }
>>> }
>>>
>>> Every component gets an XMLStreamWriter where to write their output
>>> (in a style equivalent to SAX), and transformers get an
>>> XMLStreamReader where to get their input from.
>>>
>>> The programming model is then very simple: for every call to pull(),
>>> read something in, process it and produce the corresponding output
>>> (optional, since end of processing is defined by hasNext()). The
>>> buffers used to connect the various components allow pull() to read
>>> and process a set of related events, resulting in any number of events
>>> being written to the buffer.
>>>
>>>       
>>>>> So the cursor API is the most efficient IMO when it comes to consuming
>>>>> data, since it doesn't require creating useless event objects.
>>>>>
>>>>> Now in a pipeline context, we will want to transmit events untouched
>>>>> from one component to the next one, using some partial buffering as
>>>>> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be
>>>>> the natural solution for this, but would require the use of events at
>>>>> the pipeline API level, with their associated costs mentioned above.
>>>>>           
>>>> I'm not sure if I get the point here, but we do not like to
>>>> "transmit" events. They are pulled. Therefore in most cases we simply
>>>> do not need a buffer, since events could be directly returned.
>>>>         
>>> In the previous discussions, it was considered that pulling from the
>>> previous component would lead that component to process in one pull
>>> call a number or related events, to avoid state handling that would
>>> make it even more complex than SAX. And processing these related
>>> events will certainly mean returning ("transmitting" as I said)
>>> several events. So buffering *is* needed in most non-trivial cases.
>>>
>>>       
>>>>> So what should be used for pipelines ? My impression is that we should
>>>>> stick to the most efficient API and build the simple tools needed to
>>>>> buffer events from a StreamReader, taking inspiration from the
>>>>> XMLBytestreamCompiler we already have.
>>>>>           
>>>> Maybe some events could be avoided using the cursor API, but IMO the
>>>> performance we could get is not worth the simplicity we sacrifice...
>>>>         
>>> I don't agree that we sacrifice simplicity. With the above, the
>>> developer only deals with XMLStreamWriter and XMLStreamReader objects,
>>> and never has to implement them.
>>>
>>> We just need an efficient StaxFIFO class, which people shouldn't care
>>> about since it is completely hidden in the Pipeline object.
>>>
>>> Thoughts?
>>>       
>> Well the approach outlined about will certainly work.
>>
>> Basically you're providing a buffer between every pair of components and
>> fill it as needed.
>> But you need to implement both XMLStreamWriter and XMLStreamReader and
>> optimize that for any possible thing a transformer might do.
>> In order to buffer all the data from the components you will have to
>> create some objects as well - I guess you will end up with something
>> like the XMLEvent and maintaining a list of them in the StaxFIFO.
>> That's why I think an efficient (as in faster than the Event API)
>> implementation of the StaxFIFO is difficult to make.
>>
>> On the other hand I do think that the cursor API is quite a bit harder
>> to use.
>> As stated in the Javadoc of XMLStreamReader it is the lowest level for
>> reading XML data - which usually means more logic in the code using the
>> API and more knowledge in the head of the developer reading/writing the
>> code is required.
>> So I second Andreas' statement that we will sacrifice simplicity for (a
>> small amount of ?) performance.
>>
>>
>> The other thing is that - at least the way you suggested - we would need
>> a special implementation of the Pipeline interface.
>> That is something that compromises the intention behind having a
>> Pipeline API.
>> Right now we can use the new StAX components and simply put them into
>> any of the Pipeline implementations we already have.
>> Sacrificing this is completely out of the question IMO.
>>
>>     
>
> I did a little (and quite dirty) implementation of Sylvains ideas to be able to 
> see all the advantages and drawbacks of such an approach. I came to the 
> following conclusions:
>
> First of all we do not need to change the interfaces of the pipeline api. It is 
> possible to do it in a quite similar way as we did it for the stax event 
> iteration api.
>
> Writing code with the streaming api making things a little bit more complicated 
> than with the XML-Event object. Instead of working with three methods and an 
> object you have to handle x methods (but this is my personal opinion).
>
> I'm with Steven that you'll end with a list of XMLEvent like objects in the 
> StAXFIFO buffer (my prototype does :) ). And this makes things worse... The 
> tasks which could/would be handled by such an api could be reduced to the 
> following cases: Removing XMLEvents (nodes, how ever...), adding XMLEvents, 
> changing XMLEvents and simply letting them through. With the approach we are 
> using at the moment you're required to buffer events in the case of adding them. 
> In all other cases a buffer is NOT required (thanks to the navigator idea :) ). 
> And thats the reason making a StreamReader-StreamWriter StAXFifo-Buffering 
> approach worse than the XMLEvent approach. Think about a situation with x 
> transformers and an XML where 90% simply dont have to be handled. In this case 
> for each transformer, for each node not touched, an object has to be created in 
> the buffer whereas one object is enough in the XMLEvent object approach.
>
> To sum it up: We add (ok, not too much, but we do) another layer of complexity 
> without increasing the performance (as it may look like at a first glance). IMHO 
> we'll end up with more created objects than in an Event-Iterator approach.
>   
I am still convinced the iterator API is the way to go.
The recommendations from Sun (as cited in the PDF attached to the inital 
mail) IMO clearly point in that direction:
    * If you are programming for a particularly memory-constrained 
environment, like J2ME, you can make smaller, more efficient code with 
the cursor API.
    * If performance is your highest priority--for example, when 
creating low-level libraries or infrastructure--the cursor API is more 
efficient.
    * If you want to create XML processing pipelines, use the iterator API.
    * If you want to modify the event stream, use the iterator API.
    * If you want to your application to be able to handle pluggable 
processing of the event stream, use the iterator API.
    * In general, if you do not have a strong preference one way or the 
other, using the iterator API is recommended because it is more flexible 
and extensible, thereby "future-proofing" your applications.

We're not operating in a particularly memory-constrained environment.
IMO performance is not the *highest* priority (it is important of course).

But we want to make processing pipelines, containing pluggable 
components in order to modify the event stream and use StAX as a more 
flexible and intuitive *alternative* to SAX.
If you're really all about performance you should probably stick with 
SAX anyway.


Well we can still make a shootout between the cursor and the iterator 
API at a later time.
Actually that would be really interesting and demonstrate one of the 
most important aspects (IMO) about Cocoon 3:
Fill your pipelines with whatever you want - and if the components are 
nice and sweet others might like them as well.

> Andreas
>
>   
>> Steven
>>
>>     
>>> Sylvain
>>>