cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Wallez <sylv...@apache.org>
Subject Re: XInclude optimization
Date Sun, 22 Nov 2009 14:27:23 GMT
Simone Tripodi wrote:
> Hi all guys,
> I'm very sorry if I don't appear frequently on the ML but since April
> I've been working very hard for a customer client in Paris that don't
> let me some spare time to dedicate to OS projects.
>   

Don't be sorry. We all have our own jobs/interest/duties that have 
driven us away from Cocoon. Glad to see you back!

> I'm writing because I'm sure the XInclude transformer I submitted time
> ago could be optimized, so I'd like to ask you a little help :)
>
> The state of the art is that, when including an entire document, it is
> processed efficiently through SAX APIs; the problem comes when
> processing a document referenced by xinclude+xpointer, that forces the
> processor to extract a sub-document of the included.
>
> To perform this, I implemented a DOM parsing, then through XPath I
> extract the sub-document the processor has to be included, then
> navigating the elements will be converted to SAX events. As you
> noticed, this takes time, too much IMO, but I didn't find/don't know
> any better solution :(
> Since you experienced the stax, maybe you're able to suggest me a fast
> way to parse a document with xpath and invoke SAX events, so I'm able
> to provide you a much better - and faster, above all - solution.
>
> Any hint? Every suggestion will be very appreciated.
>   

The problem with XPath and XML streaming (be it SAX or StAX) is that 
XPath is a language that allows exploring the document tree in all 
directions and thus inherently expects having the whole document tree 
available, which is clearly not compatible with streaming.

There are different approaches to solving this :
- use a deferred loading DOM implementation, which buffers events only 
when it needs them to traverse the tree. Axiom [1] provides this IIRC, 
along with an XPath implementation.
- restrain the XPointer expression to a subset of XPath that can easily 
be implemented on top of a stream. This means restricting selection only 
on the current element, its attribute and its ancestors. There's an 
implementation of this approach in Tika.

The XInclude transformer can be smart enough to use the most efficient 
implementation for the given XPath expression, i.e. try to parse it with 
Tika's restricted subset, and fallback to something more costly, either 
Axiom or plain DOM.

Sylvain

[1] http://ws.apache.org/commons/axiom/
[2] 
https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/

-- 
Sylvain Wallez - http://bluxte.net


Mime
View raw message