Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cocoon.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <495725CD.9050809@indoqa.com>
Date: Sun, 28 Dec 2008 08:07:57 +0100
From: Steven Dolg <steven.dolg@indoqa.com>
User-Agent: Thunderbird 2.0.0.18 (Windows/20081105)
MIME-Version: 1.0
To: dev@cocoon.apache.org
Subject: Re: [C3] StAX research reveiled!
References: <49520644.30108@gmail.com> <4955F707.4020705@apache.org>
 <200812271413.43183.andreas.pieber@schmutterer-partner.at>
 <4956B9B2.7050300@apache.org>
In-Reply-To: <4956B9B2.7050300@apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Sylvain Wallez schrieb:
> Andreas Pieber wrote:
>> On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote:
>>  
>>> Michael Seydl wrote:
>>>    
>>>> Hi all!
>>>>
>>>> One more mail for the student group! Behind this lurid topic hides our
>>>> evaluation of the latest XML processing technologies regarding their
>>>> usability in Cocoon3 (especially if there are suited to be used in a
>>>> streaming pipeline).
>>>> As it's commonly know we decided to use StAX as our weapon of choice
>>>> to do the XML, but this paper should explain the whys and hows and
>>>> especially the way we took to come to our decision, which resulted in
>>>> using the very same API.
>>>> Eleven pages should be a to big read and it contains all necessary
>>>> links to all the APIs we evaluated and also line wise our two cents
>>>> about the API we observed. Concludingly we also tried to show the
>>>> difference between the currently used SAX and the of us proposed StAX
>>>> API.
>>>>
>>>> I hope this work sheds some light on our decision making and taking
>>>> and that someone dares to read it.
>>>>
>>>> That's from me, I wish you all a pleasant and very merry Christmas!
>>>>
>>>> Regards,
>>>> Michael Seydl
>>>>       
>>> Good work and interesting read, but don't agree with some of its
>>> statements!
>>>
>>> The big if/else or switch statements mentioned as a drawback of the
>>> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API, 
>>> since
>>> it provides abstract events whose type needs also to be inspected to
>>> decide what to do.
>>>     
>>
>> Of course, you're right!
>>
>>  
>>> The drawbacks of the stream API compared to the event API are, as you
>>> mention, that some methods of XMLStreamReader will throw an exception
>>> depending on the current event's type and that the event is not
>>> represented as a data structure that can be passed directly to the next
>>> element in the pipeline or stored in an event buffer.
>>>
>>> The first point (exceptions) should not happen, unless the code is 
>>> buggy
>>> and tries to get information that doesn't belong to the context. I have
>>> used many times the cursor API and haven't found any usability problems
>>> with it.
>>>     
>>
>> Also here you're right, but IMHO it is not necessary to add another 
>> source for bugs if not required...
>>   
>
> Well, there are so many other sources of bugs... I wouldn't sacrifice 
> efficiency for bad usage of an API. And when dealing with XML, people 
> should know that e.g. calling getAttribute() for a text event is 
> meaningless.
>
>>> The second point (lack of data structure) can be easily solved by using
>>> an XMLEventAllocator [1] that creates an XMLEvent from the current 
>>> state
>>> of an XMLStreamReader.
>>>     
>>
>> Mhm but if we use an XMLEventAllocator, y not directly use the 
>> StAXEvent api?
>>   
>
> Sorry, I wasn't clear: *if* and XMLEvent is needed, then it's easy to 
> get it from a stream.
>
>>> The event API has the major drawback of always creating a new object 
>>> for
>>> every event (since as the javadoc says "events may be cached and
>>> referenced after the parse has completed"). This can lead to a big
>>> strain on the memory system and garbage collection on a busy 
>>> application.
>>>     
>>
>> Thats right, but having in mind to create a pull pipe, where the 
>> serializer pulls each event from the producer through each 
>> transformer and writing it to an output stream we don't have any 
>> other possibility than creating an object for each event.
>>
>> Think about it a little more in detail. To be able to pull each event 
>> you have to have the possibility to call a method looking like:
>>
>> Object next();
>>
>> on the parent of the pipelineComponent. Doing it in a StAX cursor way 
>> means to increase the complexity from one method to 10 or more which 
>> have to be available through the parent...
>>   
>
> Not necessarily, depending on how the API is designed. Let's give it a 
> try:
>
> /** A generator can pull events from somewhere and writes them to an 
> output */
> interface Generator {
>    /** Do we still have something to produce? */
>    boolean hasNext();
>
>    /** Do some processing and produce some output */
>    void pull(XMLStreamWriter output);
> }
>
> /** A transformer is a generator that has an XML input */
> interface Transformer extends Generator {
>    void setInput(XMLStreamReader input);
> }
>
> class StaxFIFO implements XMLStreamReader, XMLStreamWriter {
>    Generator generator;
>    StaxFIFO(Generator generator) {
>        this.generator = generator;
>    }
>
>    // Implement all XMLStreamWriter methods as writing to an
>    // internal stream FIFO buffer
>
>    // Implement all XMLStreamReader methods as reading from an
>    // internal stream FIFO buffer, except hasNext() below:
>
>    boolean hasNext() {
>        while (eventBufferIsEmpty() && generator.hasNext()) {
>            // Ask the generator to produce some events
>            generator.pull(this);
>        }
>        return !eventBufferIsEmpty();
>    }
> }
>
> Building and executing a pipeline is then rather simple :
>
> class Pipeline {
>    Generator generator;
>    Transformer transformers[];
>    XMLStreamWriter serializer;
>
>    void execute() {
>        Generator last = generator;
>        for (Transformer tr : transformers) {
>            tr.setInput(new StaxFIFO(previous);
>            last = tr;
>        }
>
>        // Pull from the whole chain to the serializer
>        while(last.hasNext()) {
>            last.pull(serializer);
>        }
>    }
> }
>
> Every component gets an XMLStreamWriter where to write their output 
> (in a style equivalent to SAX), and transformers get an 
> XMLStreamReader where to get their input from.
>
> The programming model is then very simple: for every call to pull(), 
> read something in, process it and produce the corresponding output 
> (optional, since end of processing is defined by hasNext()). The 
> buffers used to connect the various components allow pull() to read 
> and process a set of related events, resulting in any number of events 
> being written to the buffer.
>
>>> So the cursor API is the most efficient IMO when it comes to consuming
>>> data, since it doesn't require creating useless event objects.
>>>
>>> Now in a pipeline context, we will want to transmit events untouched
>>> from one component to the next one, using some partial buffering as
>>> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be
>>> the natural solution for this, but would require the use of events at
>>> the pipeline API level, with their associated costs mentioned above.
>>>     
>>
>> I'm not sure if I get the point here, but we do not like to 
>> "transmit" events. They are pulled. Therefore in most cases we simply 
>> do not need a buffer, since events could be directly returned.
>>   
>
> In the previous discussions, it was considered that pulling from the 
> previous component would lead that component to process in one pull 
> call a number or related events, to avoid state handling that would 
> make it even more complex than SAX. And processing these related 
> events will certainly mean returning ("transmitting" as I said) 
> several events. So buffering *is* needed in most non-trivial cases.
>
>>> So what should be used for pipelines ? My impression is that we should
>>> stick to the most efficient API and build the simple tools needed to
>>> buffer events from a StreamReader, taking inspiration from the
>>> XMLBytestreamCompiler we already have.
>>>     
>>
>> Maybe some events could be avoided using the cursor API, but IMO the 
>> performance we could get is not worth the simplicity we sacrifice...
>>   
>
> I don't agree that we sacrifice simplicity. With the above, the 
> developer only deals with XMLStreamWriter and XMLStreamReader objects, 
> and never has to implement them.
>
> We just need an efficient StaxFIFO class, which people shouldn't care 
> about since it is completely hidden in the Pipeline object.
>
> Thoughts?
Well the approach outlined about will certainly work.

Basically you're providing a buffer between every pair of components and 
fill it as needed.
But you need to implement both XMLStreamWriter and XMLStreamReader and 
optimize that for any possible thing a transformer might do.
In order to buffer all the data from the components you will have to 
create some objects as well - I guess you will end up with something 
like the XMLEvent and maintaining a list of them in the StaxFIFO.
That's why I think an efficient (as in faster than the Event API)  
implementation of the StaxFIFO is difficult to make.

On the other hand I do think that the cursor API is quite a bit harder 
to use.
As stated in the Javadoc of XMLStreamReader it is the lowest level for 
reading XML data - which usually means more logic in the code using the 
API and more knowledge in the head of the developer reading/writing the 
code is required.
So I second Andreas' statement that we will sacrifice simplicity for (a 
small amount of ?) performance.


The other thing is that - at least the way you suggested - we would need 
a special implementation of the Pipeline interface.
That is something that compromises the intention behind having a 
Pipeline API.
Right now we can use the new StAX components and simply put them into 
any of the Pipeline implementations we already have.
Sacrificing this is completely out of the question IMO.


Steven
>
> Sylvain
>