Mailing-List: contact general-help@xml.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@xml.apache.org
Message-ID: <A5374D237E78D41195810090279CC91A07190FF7@xcup04.cup.hp.com>
From: "KUMAR,PANKAJ (HP-Cupertino,ex1)" <pankaj_kumar@hp.com>
To: 'Gopal Sharma ' <Gopal.Sharma@Sun.COM>,
	"'general@xml.apache.org '" <general@xml.apache.org>,
	"'xerces-j-user@xml.apache.org '" <xerces-j-user@xml.apache.org>
Subject: RE: [Xerces2] Measuring performance and optimization
Date: Sun, 5 May 2002 19:58:45 -0700 
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"

Hi,

Few months ago I had written a program to measure Java XML parsing
performance. May be it could be of some use here. You can find details at
http://www.pankaj-k.net/xpb4j/

I am not aware of Xerces internals so whatever I say here may not make much
sense but one area where I feel that optimization at parser level can
improve performance in server based applications is use of same String
objects across parse runs. Let me elaborate -- A server program that accepts
XML documents with every request comes across instances of documents from a
small subset of schema. These documents use the same element names,
attrobutes and namespace URIs. If the same immutable String objects can be
used for these then there could be significant saving in allocation and
deallocations.

The problem is slightly complicated as the identification of repeating
Strings must happen at a much lower level, before a String object is created
of a lookup. What could do the job is perhpas some smart lookup during
lexical analysis.

Regards,
Pankaj Kumar
Web Services Architect
HP Middleware

-----Original Message-----
From: Gopal Sharma
To: general@xml.apache.org; xerces-j-user@xml.apache.org
Sent: 5/5/02 7:18 AM
Subject: [Xerces2] Measuring performance and optimization


 FYI
 
 Hi,
 
 I have forwarded this mail to _YOU_ ( general and xerces-j-user ) in
view 
 that you might be using *Xerces 2* in one way or other and could
provide 
 some data/details/suggestions/comments which would help us in this
effort.
 
 Thanks in advance for your valuable suggestion(s) and comment(s).
 
 - Gopal 

------------- Begin Forwarded Message -------------
Date: Fri, 3 May 2002 21:03:00 +0000 (Asia/Calcutta)
From: Rahul Srivastava <Rahul.Srivastava@Sun.COM>
Subject: [xerces2] Measuring performance and optimization
To: xerces-j-dev@xml.apache.org

Hi folks,

It has been long talking about improving the performance of Xerces2.
There has 
been some benchmarking done earlier, for instance the one done by Dennis

Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html .
These 
results are important to know how fast/slow xerces is as compared to
other 
parsers. But, we need to identify areas of improvement in xerces. We
need to 
calculate the time taken by each individual component in the pipeline
and figure 
out which component swallows how much time for various events and then
we can 
actually concentrate on improving performance for those areas. So, here
is what 
we plan to do:

+ sax parsing
  - time taken
+ dom parsing
  - dom construction time
  - dom traversal time
  - memory consumed
  - considering the feature deferred-dom as true/false for all of above
+ DTD validation
  - one time parse, time taken
  - multiple times parse using same instance, time taken for second
parse 
onwards
+ Schema validation
  - one time parse, time taken
  - multiple times parse using same instance, time taken for second
parse 
onwards
+ optimising the pipeline
  - calculate pipeline/component initialization time.
  - calculating the time each component in the pipeline takes to
propagate
    the event.
  - Using configurations to set up an optimised pipeline for various
cases
    such as novalidation, DTD validation only, etc. and calculate the 
    time taken. 

Apart from this should we consider the existing grammar caching
framework to 
evaluate the performance of the parser?

We have classified the inputs to be used for this testing as follows:

+ instance docs used
  - tag centric (more tags and small content say 10-50 bytes)
      Type      Tags#
    -------------------
    * small     5-50   
    * medium    50-500
    * large     >500  
    
  - content centric (less tags say 5-10 and huge content)
      Type      content b/w a pair of tag
    -------------------------------------
    * small     500 kb
    * medium    500-5000 kb
    * large     >5000 kb

We can also have depth of the tags as a criteria for the above cases.

Actually speaking, there can be enormous combinations and different
figures in 
the above table that reflect the real word instance docs used. I would
like to 
know the view of the community here. Is this data enough to evaluate the

performance of the parser. Is there any data which is publicly available
and can 
be used for performance evaluation?.

+ DTD's used
  - should use different types of entities
  
+ XMLSchema's used
  - should use most of the elements and datatypes
  
Will it really help in any way?

Any comments or suggestions appreciated.

Thanks,
Rahul.


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


------------- End Forwarded Message -------------


---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org