Return-Path: Delivered-To: apmail-xml-axis-dev-archive@xml.apache.org Received: (qmail 75068 invoked by uid 500); 2 Apr 2001 21:37:51 -0000 Mailing-List: contact axis-dev-help@xml.apache.org; run by ezmlm Precedence: bulk Reply-To: axis-dev@xml.apache.org list-help: list-unsubscribe: list-post: Delivered-To: mailing list axis-dev@xml.apache.org Received: (qmail 75049 invoked from network); 2 Apr 2001 21:37:48 -0000 Received: from e32.co.us.ibm.com (HELO e32.bld.us.ibm.com) (32.97.110.130) by h31.sny.collab.net with SMTP; 2 Apr 2001 21:37:48 -0000 Received: from westrelay01.boulder.ibm.com (westrelay01.boulder.ibm.com [9.99.140.22]) by e32.bld.us.ibm.com (8.9.3/8.9.3) with ESMTP id RAA37768; Mon, 2 Apr 2001 17:38:10 -0400 Received: from f6n96e (d03nm104h.boulder.ibm.com [9.99.140.96]) by westrelay01.boulder.ibm.com (8.8.8m3/NCO v4.95) with ESMTP id PAA75042; Mon, 2 Apr 2001 15:37:50 -0600 Importance: Normal To: axis-dev@xml.apache.org Cc: xerces-dev@xml.apache.org Subject: RE: cvs commit: xml-axis/java/src/org/apache/axis/utils XMLUtils. java X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000 Message-ID: From: "James M Snell" Date: Mon, 2 Apr 2001 14:37:41 -0700 X-MIMETrack: Serialize by Router on D03NM104/03/M/IBM(Release 5.0.6 |December 14, 2000) at 04/02/2001 03:37:50 PM, Serialize complete at 04/02/2001 03:37:50 PM MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Spam-Rating: h31.sny.collab.net 1.6.2 0/1000/N Comments inline: - James Snell Software Engineer, Emerging Technologies, IBM jasnell@us.ibm.com (online) jsnell@lemoorenet.com (offline) >Please respond to axis-dev@xml.apache.org >To: axis-dev@xml.apache.org >cc: xerces-dev@xml.apache.org, jhunter@collab.net >Subject: RE: cvs commit: xml-axis/java/src/org/apache/axis/utils XMLUtils. java > > > > >Hi Sam! > >I agree with the spirit of everything you say here. For the benefit of >myself as well as the others who may not be as in tune with SOAP, I'm going >to quickly run down some bullet points about the environment we're in. >These are in no particular order, but cover what I consider the important >facets of the job we have to do. This begins to describe our requirements, >I hope. > >* SOAP is XML. It's basically structured as follows: > > > > > > > > > > >* Inside the header and body entries may be XML-encoded language objects, >particularly ones which are encoded as specified in the SOAP spec [1]. The >encoding (in section 5 of the spec) calls out the use of the XML Schema >basic types, plus a few other rules about structures and arrays. > >* One feature of the SOAP section 5 encoding is "multi-ref accessors", which >work like this: > > xmlns:foo="urn:foo" > xmlns:xsi="schema-instance-uri" > xmlns:xsd="schema-data-uri"> > > > > > > 5 > > > > (both the foo:header and the foo:body are references to the same integer) > Of lesser importance to this discussion but worth mentioning: Within SOAP (and the SOAP with Attachments (SWA) specification) the id/href mechanism is also used to reference content located externally from the SOAP Envelope. For example, if we were referencing some part in a SWA MIME envelope, we'd have something like: >* To deserialize multi-ref accessors, we may need to look arbitrarily far >ahead in the document for the element with the correct id. This makes a >straight-ahead "streaming" approach (process the XML in order as it comes >in) somewhat challenging. Also, different pieces of code may desire to >process particular headers in an order different from that in which they are >serialized in the XML. > >* There is some concern that the XML, especially the body entries, may get >to be really large (giant base-64-encoded documents, for instance), hence we >are somewhat cautious about assuming we need to pull the whole document into >memory before processing. I note that there is a school of thought here (to >which I subscribe, btw) that says it's pilot error to try and send a huge >chunk of data inside your XML; rather you should take such things and attach >them per the SOAP with Attachments spec [2]. > >* We need this stuff to be parsed into some usable form very quickly and >efficiently. > >* Some developers will want direct access to the XML within a particular >part of the envelope as DOM, or JDOM, or perhaps SAX events. > >* Graham Glass claims to parse XML into an internal object model (I suspect >he parses the whole document before processing, btw) EXTREMELY quickly using >his Electric XML parser [3]. This model is used for SOAP processing. > >* W3C XML Protocol [4] will be arriving on the scene at some point. We'd >like to abstract out as much of the SOAPness as possible so that Axis can >easily become XMLP-compatible as soon as possible. > >Is there other stuff I've left out, folks? > This is a pretty complete list Glen... thanks for putting it together. I was actually planning on chunking something like this out later this afternoon... :-) If I think of anything else to add later, I will. >OK, so as I said, I agree with Sam's points here. The first thing I'd like >to do is some basic performance testing of various XML parsing models. I do >not see a real streaming approach being all that viable for Axis v1.0 (I'm >open to argument on that). If that is the case, we're talking about parsing >the document into some object model. As I see it, we can either: 1) use a >pre-existing model like DOM or JDOM, or 2) use SAX or a pull parser such as >XPP to parse into our own SOAP-specific object model. > >Option 2 might be faster. Option 1 gains us a standard programming model >(i.e. when developers ask us for JDOM/DOM we can just give it to them), plus >perhaps a speedier development cycle. > My personal vote would be for #2. I've been taking a look at a few different options regarding pull parsers and how we might be able to approach it from an Axis standpoint. Here's what I've come up with (these represent possible options, I'm not recommending any one of them at this point): 1. We could create a simple pull parser interface on top of the Xerces XMLParser class in much the same way that Xalan-J v1.x does with it's DTM implementation (see: http://xml.apache.org/websrc/cvsweb.cgi/xml-xalan/src/org/apache/xalan/xpath/dtm/ ) This is a Xalan proprietary interface, however, there happens to be a partial DOM implementation built on top of DTM. (see the DTMProxy class at the above location) The advantage to this approach is that we can use Xerces, have the advantage of a pull parser, and can still use DOM. The drawback to this approach is that it still uses the Xerces parser -- the same one that the Xerces DOM implementation uses, and the same one that proves to be a bit of a drag on the performance side of things. Now, perhaps the Xerces2 guys can help us out with this: how much of a speed enhancement will the Xerces2 parser have over Xerces1? 2. We could mature XPP to a point where it is more usable (a task that is already in progress) and layer a SAX and DOM layer over the top. The advantage to this approach is XPP's size and speed (granted, the size will increase and the speed will decrease as the code matures, but not by a great deal). The disadvantage to this approach is that XPP is currently not an Apache product. Some would say that XPP's lack of full-XML well-formedness checks is a disadvantage, but I would disagree -- a reasonable assumption can be made on the behalf of the XML processor that SOAP Envelopes that it will be dealing with are well-formed making well-formedness checks unncessary. Those parts of the XML specification that are missing from XPP that prove to be necessary can be added. It would be relatively easy to layer a large enough usable subset of DOM on top of XPP to suit our purposes fine. And as a DOM implementation, XPP can be made to work with JAXP and other parsers without problem. 3. We could rewrite the Xerces Pull Parser interface at a low level to make it much faster and more directly usable. The advantage to this approach is that Xerces gets better and faster. The disadvantage to this approach is time. This would take some time to do but I think that it would be well worth it. (I also think that this is our best long term solution) >I'd like to do the simplest possible thing that gives us the desired >results. > Of the three options above, 1 and 2 are the simplest to do. >Jason, do you have any numbers/stats as to whether parsing into JDOM using >SAX is faster than a typical DOM parse in, say, Xerces? > Axis Devs: We should arrange a conference call or IRC chat soon for all of us to make sure we're all the same page as far as parser requirements are concerned. We also need to spend some time putting together that list of use case scenarios. As Sam suggested, we need some numbers. Specifically, we need to get some benchmarks for the major SOAP implementations currently available that cover the following areas: * Parse Time - initially reading the SOAP envelope and making the content available to the application * Throughput - number of messages processed per second * Runtime Memory Footprint * Distribution Footprint (size of package) (Which brings up another goal that I personally have for Axis -- I want it to be small. Axis should be able to run on resource-limited devices.) Xerces Devs: Axis does not necessarily require a full-featured XML parser to do it's thing (the applications that are deployed within the Axis framework might -- but the Axis engine itself does not). What we do need is an extremely fast, extremely efficient way of quickly extracting information from an XML structure. This discussion covers what will be inside Axis, not necessarily what the end user will see while they are writing applications that use Axis. We are currently working on defining a standard Axis Message API that provides a simple means of abstracting the artifacts of XML Messaging (things like Headers, the Body, Attachments, etc). For the most part, developers using Axis will use this Message API, DOM or SAX to access the contents of the message. Axis 1.0 will ship with a SOAP-specific implementation of this Message API. Whatever parser we use will sit beneath this implementation. The chances that the end user will ever have to direct contact with the underlying parser is next to nothing. That said, we don't really give a flip if the underlying parser is standards compliant or not, as long as we can layer standards support on top of it. As I've already mentioned -- our number one core requirement for the underlying XML parser in Axis is speed. nothing else matters near as much as that. We can deal with validation of specific components of the SOAP Envelope at a higher level as needed. >Over and out for now, > >--Glen > >[1] http://www.w3.org/TR/soap >[2] http://www.w3.org/TR/SOAP-attachments >[3] http://www.themindelectric.com/products/xml/xml.html >[4] http://www.w3.org/2000/xp/ >