cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Zhang" <>
Subject Re: [ANN] VTD-XML Version 1.5 Released
Date Mon, 20 Feb 2006 18:10:09 GMT
Hi, Thanks for the email.
My answers to your questions:
1. It is a tradeoff-VTD-XMl consumes more memory, but
is easy to use and more powerful, Any random access capable 
XML processing API *needs* to at least load the entire hierachical 
structure in memory. My take is that among SAX, STAX, DOM
and JDOM, vtd-xml is the least likely one to choke, and best one
to handle peak loads...
2. Agree with you, benchmarking a dummy SAX parser is unfair for VTD-XML,
that will make VTD-XML look prettier in real life scenario.
3. Look at all the vertical industry XML related vocubalry,  SOAP,
Rest and XML schema, and infoset data model, DTD seems deprecated
a bit, and VTD-XMl doesn't support external entities... other than that
VTD-XML is equally capable 


----- Original Message ----- 
From: "Stefano Mazzocchi" <>
To: <>
Sent: Sunday, February 19, 2006 8:57 PM
Subject: Re: [ANN] VTD-XML Version 1.5 Released

> Hmmmm, I have to admit that I've toyed with this idea myself lately, 
> especially since I'm diving deep into processing large quantities of XML 
> files these days (when I say 'large', I mean it, large that 32 bits of 
> address space are not enough).
> The idea of non-extracting parsing is nice but there are few issues:
>  1) the memory requirements, still much less than DOM, but are still 
> *way* more than an event-driven model like SAX. Cocoon, for example, 
> would die if we were to move to a parser like this one, especially under 
> load spikes.
>  2) benchmarking against a dummy SAX content handler is completely 
> meaningless. in order for the API to be of any use, you have to create 
> strings, you can't simply pass pointers to char arrays around. I bet 
> that if the SAX parser could go on without creating strings, it would be 
> just as fast (xerces, in fact, does use a similar mechanism to return 
> you the character() SAX event, where the entire document is kept in 
> memory and the start/finish pointers are passed instead of a new array.
>  3) 90% of the slowness comes from 10% of the details in the XML spec, 
> which means in order to keep fast, you need to sacrifice compliance... 
> which is not an option these days given how cheap silicon is.
> But don't get me wrong, I think there is something interesting in what 
> you are doing: I think it would be cool if you could serialize the 'tree 
> index' alongside the document on disk and provide some sort of b-tree 
> indexing for it. It would help me in my multi-GB-of-XML day2day struggle.
> You claim xpath random access, but what is the algorithmical complexity 
> of that? O(1), O(log(n)), O(n), O(n*log(n))? If one were to store the 
> parsed tree index on disk, how many pages would one need to page in 
> before reaching the required xpath?
> -- 
> Stefano.

View raw message