cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [ANN] VTD-XML Version 1.5 Released
Date Mon, 20 Feb 2006 04:57:01 GMT
Jimmy Zhang wrote:
> Eight years after the invention of XML, DOM and SAX,
> despite their respective issues, are still the mainstays
> of application developers. 
>  
> So is it the end of road for XML parsing innovation?
>  
> The VTD-XML project team think not. We are proud to
> announce the availability of both C and Java version
> 1.5 of VTD-XML, the next generation open-source XML
> parser that goes beyond DOM and SAX in terms of
> performance, memory usage and ease of use.
>  
> The technical highlights of VTD-XML are:
>  
> * Performance: the world's fastest XML parser,
>   between 5x~10x faster than DOM
> * Memory Usage: 3x to 5x less than DOM, 1.3x~1.5x
>   XML document size
> * Random access with built-in XPath support
> * A simple and intuitive API
>  
> Other advanced features include:
> * Buffer reuse
> * Large document support (2GByte)
> * Incremental update
> * Hardware acceleration
> * Native XML indexing.
>  
> For demos, latest benchmarks, related articles and software
> downloads, please visit http://vtd-xml.sf.net. Also let us
> know your thoughts and suggestions and help us improve
> VTD-XML.

Hmmmm, I have to admit that I've toyed with this idea myself lately, 
especially since I'm diving deep into processing large quantities of XML 
files these days (when I say 'large', I mean it, large that 32 bits of 
address space are not enough).

The idea of non-extracting parsing is nice but there are few issues:

  1) the memory requirements, still much less than DOM, but are still 
*way* more than an event-driven model like SAX. Cocoon, for example, 
would die if we were to move to a parser like this one, especially under 
load spikes.

  2) benchmarking against a dummy SAX content handler is completely 
meaningless. in order for the API to be of any use, you have to create 
strings, you can't simply pass pointers to char arrays around. I bet 
that if the SAX parser could go on without creating strings, it would be 
just as fast (xerces, in fact, does use a similar mechanism to return 
you the character() SAX event, where the entire document is kept in 
memory and the start/finish pointers are passed instead of a new array.

  3) 90% of the slowness comes from 10% of the details in the XML spec, 
which means in order to keep fast, you need to sacrifice compliance... 
which is not an option these days given how cheap silicon is.

But don't get me wrong, I think there is something interesting in what 
you are doing: I think it would be cool if you could serialize the 'tree 
index' alongside the document on disk and provide some sort of b-tree 
indexing for it. It would help me in my multi-GB-of-XML day2day struggle.

You claim xpath random access, but what is the algorithmical complexity 
of that? O(1), O(log(n)), O(n), O(n*log(n))? If one were to store the 
parsed tree index on disk, how many pages would one need to page in 
before reaching the required xpath?

-- 
Stefano.


Mime
View raw message