axis-java-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James M Snell" <jasn...@us.ibm.com>
Subject RE: cvs commit: xml-axis/java/src/org/apache/axis/utils XMLUtils. java
Date Mon, 02 Apr 2001 21:37:41 GMT
Comments inline:

- James Snell
     Software Engineer, Emerging Technologies, IBM
     jasnell@us.ibm.com (online)
     jsnell@lemoorenet.com (offline)

>Please respond to axis-dev@xml.apache.org 
>To:    axis-dev@xml.apache.org
>cc:    xerces-dev@xml.apache.org, jhunter@collab.net 
>Subject:       RE: cvs commit: xml-axis/java/src/org/apache/axis/utils XMLUtils. java
>
>
>
>
>Hi Sam!
>
>I agree with the spirit of everything you say here.  For the benefit of
>myself as well as the others who may not be as in tune with SOAP, I'm 
going
>to quickly run down some bullet points about the environment we're in.
>These are in no particular order, but cover what I consider the important
>facets of the job we have to do.  This begins to describe our 
requirements,
>I hope.
>
>* SOAP is XML.  It's basically structured as follows:
>
><SOAP-ENV:envelope xmlns:SOAP-ENV="insert-important-url-here">
> <SOAP-ENV:header>
>  <header-entry />
> </SOAP-ENV:header>
> <SOAP-ENV:body>
>  <body-entry />
> </SOAP-ENV:body>
></SOAP-ENV:envelope>
>
>* Inside the header and body entries may be XML-encoded language objects,
>particularly ones which are encoded as specified in the SOAP spec [1]. 
The
>encoding (in section 5 of the spec) calls out the use of the XML Schema
>basic types, plus a few other rules about structures and arrays.
>
>* One feature of the SOAP section 5 encoding is "multi-ref accessors", 
which
>work like this:
>
><SOAP-ENV:envelope xmlns:SOAP-ENV="insert-important-url-here"
>                   xmlns:foo="urn:foo"
>                   xmlns:xsi="schema-instance-uri"
>                   xmlns:xsd="schema-data-uri">
> <SOAP-ENV:header>
>  <foo:header ref="#1" />
> </SOAP-ENV:header>
> <SOAP-ENV:body>
>  <foo:body ref="#1" />
>  <foo:actualElement id="1" xsi:type="xsd:int">5</foo:actualElement>
> </SOAP-ENV:body>
></SOAP-ENV:envelope>
>
>  (both the foo:header and the foo:body are references to the same 
integer)
>

Of lesser importance to this discussion but worth mentioning:

Within SOAP (and the SOAP with Attachments (SWA) specification) the 
id/href mechanism is also used to reference content located externally 
from the SOAP Envelope.  For example, if we were referencing some part in 
a SWA MIME envelope, we'd have something like:

<Envelope>
   <Body>
      <something href="cid:whatever" />
   </Body>
</Envelope>


>* To deserialize multi-ref accessors, we may need to look arbitrarily far
>ahead in the document for the element with the correct id.  This makes a
>straight-ahead "streaming" approach (process the XML in order as it comes
>in) somewhat challenging.  Also, different pieces of code may desire to
>process particular headers in an order different from that in which they 
are
>serialized in the XML.
>
>* There is some concern that the XML, especially the body entries, may 
get
>to be really large (giant base-64-encoded documents, for instance), hence 
we
>are somewhat cautious about assuming we need to pull the whole document 
into
>memory before processing.  I note that there is a school of thought here 
(to
>which I subscribe, btw) that says it's pilot error to try and send a huge
>chunk of data inside your XML; rather you should take such things and 
attach
>them per the SOAP with Attachments spec [2].
>
>* We need this stuff to be parsed into some usable form very quickly and
>efficiently.
>
>* Some developers will want direct access to the XML within a particular
>part of the envelope as DOM, or JDOM, or perhaps SAX events.
>
>* Graham Glass claims to parse XML into an internal object model (I 
suspect
>he parses the whole document before processing, btw) EXTREMELY quickly 
using
>his Electric XML parser [3].  This model is used for SOAP processing.
>
>* W3C XML Protocol [4] will be arriving on the scene at some point.  We'd
>like to abstract out as much of the SOAPness as possible so that Axis can
>easily become XMLP-compatible as soon as possible.
>
>Is there other stuff I've left out, folks?
>

This is a pretty complete list Glen... thanks for putting it together.  I 
was actually planning on chunking something like this out later this 
afternoon... :-)   If I think of anything else to add later, I will.


>OK, so as I said, I agree with Sam's points here.  The first thing I'd 
like
>to do is some basic performance testing of various XML parsing models.  I 
do
>not see a real streaming approach being all that viable for Axis v1.0 
(I'm
>open to argument on that).  If that is the case, we're talking about 
parsing
>the document into some object model.  As I see it, we can either: 1) use 
a
>pre-existing model like DOM or JDOM, or 2) use SAX or a pull parser such 
as
>XPP to parse into our own SOAP-specific object model.
>
>Option 2 might be faster.  Option 1 gains us a standard programming model
>(i.e. when developers ask us for JDOM/DOM we can just give it to them), 
plus
>perhaps a speedier development cycle.
>

My personal vote would be for #2.  I've been taking a look at a few 
different options regarding pull parsers and how we might be able to 
approach it from an Axis standpoint.  Here's what I've come up with (these 
represent possible options, I'm not recommending any one of them at this 
point):

1. We could create a simple pull parser interface on top of the Xerces 
XMLParser 
   class in much the same way that Xalan-J v1.x does with it's DTM 
implementation 

   (see: 
http://xml.apache.org/websrc/cvsweb.cgi/xml-xalan/src/org/apache/xalan/xpath/dtm/ 
 )

   This is a Xalan proprietary interface, however, there happens to be a 
partial 
   DOM implementation built on top of DTM.  (see the DTMProxy class at the 
above 
   location)

   The advantage to this approach is that we can use Xerces, have the 
advantage of
   a pull parser, and can still use DOM. 

   The drawback to this approach is that it still uses the Xerces parser 
-- the same
   one that the Xerces DOM implementation uses, and the same one that 
proves to be
   a bit of a drag on the performance side of things.

   Now, perhaps the Xerces2 guys can help us out with this:  how much of a 
speed
   enhancement will the Xerces2 parser have over Xerces1?

2. We could mature XPP to a point where it is more usable (a task that is 
already
   in progress) and layer a SAX and DOM layer over the top. 

   The advantage to this approach is XPP's size and speed (granted, the 
size will
   increase and the speed will decrease as the code matures, but not by a 
great
   deal).

   The disadvantage to this approach is that XPP is currently not an 
Apache product.
   Some would say that XPP's lack of full-XML well-formedness checks is a 
disadvantage,
   but I would disagree -- a reasonable assumption can be made on the 
behalf of the
   XML processor that SOAP Envelopes that it will be dealing with are 
well-formed 
   making well-formedness checks unncessary.  Those parts of the XML 
specification 
   that are missing from XPP that prove to be necessary can be added.

   It would be relatively easy to layer a large enough usable subset of 
DOM on top 
   of XPP to suit our purposes fine.  And as a DOM implementation, XPP can 
be made to
   work with JAXP and other parsers without problem. 

3. We could rewrite the Xerces Pull Parser interface at a low level to 
make it much
   faster and more directly usable. 

   The advantage to this approach is that Xerces gets better and faster.

   The disadvantage to this approach is time.  This would take some time 
to do but 
   I think that it would be well worth it.  (I also think that this is our 
best 
   long term solution)

>I'd like to do the simplest possible thing that gives us the desired
>results.
>

Of the three options above, 1 and 2 are the simplest to do.

>Jason, do you have any numbers/stats as to whether parsing into JDOM using
>SAX is faster than a typical DOM parse in, say, Xerces?
>

Axis Devs:

We should arrange a conference call or IRC chat soon for all of us to make 
sure we're all the same page as far as parser requirements are concerned. 
We also need to spend some time putting together that list of use case 
scenarios. 

As Sam suggested, we need some numbers.  Specifically, we need to get some 
benchmarks for the major SOAP implementations currently available that 
cover the following areas:

  * Parse Time - initially reading the SOAP envelope and making the 
content available to the application
  * Throughput - number of messages processed per second
  * Runtime Memory Footprint
  * Distribution Footprint (size of package)

(Which brings up another goal that I personally have for Axis -- I want it 
to be small.  Axis should be able to run on resource-limited devices.)


Xerces Devs:

Axis does not necessarily require a full-featured XML parser to do it's 
thing (the applications that are deployed within the Axis framework might 
-- but the Axis engine itself does not).  What we do need is an extremely 
fast, extremely efficient way of quickly extracting information from an 
XML structure.  This discussion covers what will be inside Axis, not 
necessarily what the end user will see while they are writing applications 
that use Axis. 

We are currently working on defining a standard Axis Message API that 
provides a simple means of abstracting the artifacts of XML Messaging 
(things like Headers, the Body, Attachments, etc).  For the most part, 
developers using Axis will use this Message API, DOM or SAX to access the 
contents of the message. 

Axis 1.0 will ship with a SOAP-specific implementation of this Message 
API.  Whatever parser we use will sit beneath this implementation.  The 
chances that the end user will ever have to direct contact with the 
underlying parser is next to nothing.  That said, we don't really give a 
flip if the underlying parser is standards compliant or not, as long as we 
can layer standards support on top of it.

As I've already mentioned -- our number one core requirement for the 
underlying XML parser in Axis is speed.  nothing else matters near as much 
as that.  We can deal with validation of specific components of the SOAP 
Envelope at a higher level as needed. 


>Over and out for now,
>
>--Glen
>
>[1] http://www.w3.org/TR/soap
>[2] http://www.w3.org/TR/SOAP-attachments
>[3] http://www.themindelectric.com/products/xml/xml.html
>[4] http://www.w3.org/2000/xp/
>


Mime
View raw message