commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Kitching <skitch...@apache.org>
Subject Re: [digester2] performance of ns-aware parsing
Date Sun, 06 Feb 2005 01:11:13 GMT
On Thu, 2005-02-03 at 07:52 -0800, Reid Pinchback wrote: 
> --- Simon Kitching <skitching@apache.org> wrote:
> 
> > On Wed, 2005-02-02 at 20:45 -0800, Reid Pinchback wrote:
> > Of course if someone can demonstrate that non-namespace-aware parsers
> > *are* still useful then I'll change my mind.
> 
> Just to clarify, since I was being sloppy before (I gotta
> stop typing in shorthand) there is an important distinction:
> 
> a) having NS-aware parser, always using NS-aware API methods
> b) having NS-aware parser, selectively using NS-aware API methods
> c) having non-NS-aware parser (and obviously never using NS-aware API methods)
> d) having NS-aware parser where the developer fixes a grammar that
>    ignores any NS distinctions
> 


> Even for Sax the performance difference between (a) and (b) is roughly 
> a factor of 2 across all parsers when processing small (typical message-sized) 
> docs that don't use NS. 

I would *really* love to see some actual measurements on this if you can
find some. You seem to be quoting from some study you have done or read
- it would be great to have this. [See comments on Piccolo below]


>  Mucking with (d) is supposed to result in significant
> wins when you tune the grammar handling to your app, but I haven't tried it 
> myself and I've never seen timing differences quoted.  
> 

I don't quite understand what (d) means, but is it actually relevant?
Again, we are talking about *namespaces* not validation.

The w3c namespaces spec clearly makes a distinction between namespaces
and whether or not the namespace URI "means" anything:

<quote source="http://www.w3c.org/TR/xml-names11/">
Note also that the Namespaces specification says nothing about what
might (or might not) happen if one were to attempt to dereference a
URI/IRI used to identify a namespace.
</quote>

What I'm trying to achieve is to avoid having actions or patterns deal
with element-names containing prefixes, eg stating that an element's
name is "foo:item". This is just broken; the item's name is really the
tuple (some-namespace, item).

Grammars/schemas can optionally be bound to namespaces, but namespaces
themselves are a lower layer that can be used without any of these
things. I'm talking here about requiring the parser to convert
<foo:item> into (namespace, item) but do not intend to imply that any
kind of schema should be loaded for the specified namespace. 

The XMLReader.setNamespaceAware(true) method does exactly this; enables
mapping of prefixes -> namespaces, but does not enable processing of
either DTDs or schemas.


> I'm not trying to advocate any approach except to notice that, since your 
> README mentioned requiring a namespace-aware parser, it sounded like 
> there was a potential for options (b), (c), and (d) to become unintentionally
> closed to developers in Digester2 when they weren't in Digester1. 

Well, I did intend to close options (b) and (c) as I didn't believe
there was any reason at all to support them. Some real measurements
showing the kind of performance you quote would definitely change my
mind.

>  I agree
> that old parsers providing (c) aren't particularly interesting, but
> if you spend any time tracing through the guts of the parsing, particularly
> when you see how DTDs are loaded for entity resolution, you begin to see 
> (d) as having potential.  Throwing (b) away may result in less code in
> Digester2, but it may be worth doing some timing tests to see if that 
> code reduction is consequence-free.

What does loading DTDs have to do with namespaces?


> > I still find it hard to believe that leaving out namespace support makes
> > a performance difference. The parser needs to keep a map of
> >    prefix->(stack of namespace)
> > and that's about it. 
> 
> Actually the XML spec distinguishes between the default namespace
> and all other namespaces, so parsers can reasonably make the same
> distinction and try to avoid a bunch of per-entity operations and 
> temporary object creations in the case where there is no namespace.

Sorry, what per-entity operations, and what temporary object creations?

> Look at the piccolo stats published on Sourceforge.  Compare Soap, 
> Soap+NS, and random XML-no NS timings and it suggests that NS 
> ain't free.
> 
> Useful links:
> 
>   Jade (now part of Javolution) http://javolution.org/api/index.html,
>   look at the javolution.xml package (trades String for CharSequence
>   to increase performance, but keeps NS)

Hmm.. I've added a reference to javolution to the wiki. 

However I couldn't find any info on the performance of namespaceAware vs
nonNamespaceAware...

> 
>   Picollo you probably already have the link for, but for anybody
>   else interested: http://piccolo.sourceforge.net

Piccolo does have a page where they state their performance tests for
"SOAP - namespaces off" is about 12% faster than "SOAP - namespaces on".
But there is no further info on what these phrases mean.

The piccolo site provides a download for "SAXBench" benchmarking tool,
but (a) I never managed to get this working, and (b) it doesn't seem to
include the SOAP tests referenced anyway.

http://piccolo.sourceforge.net/bench.html

> 
>   Zapthink comments on XML parsing challenges,
>   http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci858888,00.html

No occurrence of the word "namespace" anywhere in the article.

> 
>   Developerworks articles on XML performance,
>   http://www-106.ibm.com/developerworks/xml/library/x-perfap1.html
> 

This article had this paragraph:
<quote>
You should also avoid using namespaces in your applications unless
they're absolutely necessary. Processing a document with the namespace
feature enabled can slow the processing of the whole document. A parser
not only processes namespace declarations, verifying their correctness,
but it also ensures that an XML document is namespace well-formed.
</quote>
but I believe this refers only to code that builds DOMs then serializes
them; during serialization the DOM tree is checked to make sure all
elements have valid namespace declarations. This is not relevant to
digester.

>   Sun articles on XML performance,
>   http://java.sun.com/developer/technicalArticles/xml/JavaTechandXML_part3/

This article didn't seem to have any performance info about namespaces. 





So in summary: 

My instincts still tell me that:
* for documents that don't use namespaces, enabling namespace-aware
parsing will have no impact at all. 
* for documents that do use namespaces, sane coders will want proper
namespace-aware support anyway
* for performance-maniacs of the sort who would deliberately process
documents with namespaces using a non-namespace-aware parser in order to
get faster performance, they are out of luck and will have to wear a
performance hit of about 1%. Or they can patch digester themselves.

The piccolo stats suggest they tested *something* to do with namespaces
and got a 12% hit, but as no further details are provided it's hard to
tell whether this is relevant or not.

For the moment, therefore, I don't intend to add non-ns-aware-parser
support for digester2. Anyone else is very welcome to provide a proper
performance test that proves me wrong at which time I will offer my
congratulations and personally commit their patch to add this feature.



Regards,

Simon


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message