Return-Path: Mailing-List: contact general-help@xml.apache.org; run by ezmlm Delivered-To: mailing list general@xml.apache.org Received: (qmail 51044 invoked from network); 12 Jul 2000 18:03:02 -0000 Received: from lotus2.lotus.com (192.233.136.8) by locus.apache.org with SMTP; 12 Jul 2000 18:03:02 -0000 Received: from internet2.lotus.com (internet2.lotus.com [9.95.4.236]) by lotus2.lotus.com (8.9.3/8.9.3) with ESMTP id OAA29658 for ; Wed, 12 Jul 2000 14:03:30 -0400 (EDT) Received: from a3mail.lotus.com (A3MAIL.lotus.com [9.95.5.66]) by internet2.lotus.com (8.9.3/8.9.3) with ESMTP id OAA26039 for ; Wed, 12 Jul 2000 14:02:32 -0400 (EDT) Subject: Re: parser-next-gen goals, plan, and requirements To: general@xml.apache.org X-Mailer: Lotus Notes Release 5.0.1 July 16, 1999 Message-ID: From: "Scott Boag/CAM/Lotus" Date: Wed, 12 Jul 2000 13:59:46 -0400 X-MIMETrack: Serialize by Router on A3MAIL/CAM/H/Lotus(Build V505_07062000 |July 6, 2000) at 07/12/2000 02:05:05 PM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii James Duncan Davidson wrote: > +1 -- note that this could be an "optional" DOM shipped as an external .jar > file. In fact, I'd like to see as a requirement the ability to build into a > set of jars that reflects the modules so that it's clear how to assemble a > stripped down parser for whatever use. I would like to see this "layered" above the read-write DOM, so that the read-write DOM would work exactly as well. This would go a long way in decreasing testing pain. > Ok, these are all requirements on the DOM module. Which one, Read-Only, > Read-Write, or both? As I said, preferably on the read-only, from which the read-write could be derived. We'll always have to handle a *pure* DOM interface, without these features, but it would be nice if performance doesn't massively degrade if you choose to use a read-write DOM. > > 3) parse-next function, with added control over buffer size. > > Explain more. Would this be the ability to feed in an input source that says > "grab 16K at a time from the underlying stream and feed it into the parser"? Exactly. Xerces1 does this now, but I don't think you have much control over the size of the parse block. > Once again, explain more.. I think that basic programming tenants say that > if I hand a buffer to a consumer, whether via the SAX provider, or any other > provider, I'm not going to much with it until it's released. SAX calls a character event like so: /** * Receive notification of character data inside an element. * * @param ch The characters. * @param start The start position in the character array. * @param length The number of characters to use from the * character array. * @exception org.xml.sax.SAXException Any SAX exception, possibly * wrapping another exception. * @see org.xml.sax.ContentHandler#characters */ public void characters (char ch[], int start, int length) throws SAXException; Typically, you will see the same array reference passed for over and over, with the start argument increasing, then, the parser will fill up it's buffer, and start from the beginning again. What this means is that the processor has to copy the characters over to a new array... so you end up with the characters: 1) copied from the original byte stream to a UNICODE array with entities expanded, etc., 2) copied from the SAX event buffer to a stable character array/string for the source tree (DOM) to hold, 3) processed character-by-character for de-encoding to the result stream. It would be nice to not do (2) when feasible. Note that, often (80% of the time?), you end up with the exact array of bytes in the output that you had in the input... which is one reason why a tight coupling with the parser could be used to get really fast performance... if the right context is sensed, the bytes could be copied directly from the input to the output. Note that issues like this are still a far bigger win, in my (potentially flawed) estimation, than compiling a stylesheet to bytecodes. I've talked to James Clark about this, who has much longer ranging experience in this type of processing than I, and he agrees. > I've intentially left out serialization as a discussion point to date as > it's not parser, it's part of the larger toolset. Hmm... perhaps we should start the architecture (but not the requirements... in some sense it's immaterial to the requirements where the feature comes from) from this bigger picture. > Compiled stylesheets are something that would be different than a parser in > mind mind -- wouldn't this be something that sits at the Xalan level? I didn't mean to imply otherwise... but I don't think the microparser would be at the Xalan level. I could be wrong. By what I could tell by dumping the Translets jar, they seemed to have a separate parser, not a tightly bound one. I haven't been able to get at the bottom of where parsing services fits into the Translets picture, in Sun's vision of it. > (Hopefully with those folks at Sun helping out :). No need to duck, I'm hopeful too (as long as it is a collaborative effort). > Right. What I'm thinking is a build target that produces: > > parser-core.jar > validator-dtd.jar > validator-schema.jar > producer-sax.jar > producer-domrw.jar > producer-domro.jar > producer-jdom.jar Fine. > I would actually make our target goal 1.2/1.3, with 1.1 > compatibility (possibly, I'd actually really like to push forward so that > collections and whatever else can be used). I'm fine with that. I've been toying with using weak references in Xalan2, which would break 1.1 compatibility. I really need them. Are there any opinions from the Cocoon folks on this (from the standpoint of Xalan)? > Ok. Should I propose an apache-auc module to the joint jakarta/xml efforts > to collect these sorts of things? Fine by me. I need: 1) Probably Heap and HeapObject(though I have not started using this yet). Implementation is no big deal, so I'm just as happy to have a proprietary one in Xalan. 2) ObjectPool object. Implementation is no big deal, so I'm just as happy to have a proprietary one in Xalan. 3) PrefixResolver -- The class that implements this interface can resolve prefixes to namespaces -- needed for XPath interface support. 4) QName -- object to represent a qualified name, needs to be shared with the Serializer. Currently part of TRaX, should be in SAX. 5) SystemIDResolver -- Take a SystemID string and try and turn it into a good absolute URL (see org.apache.xerces.utils.URI, which is a great bit of work, in my opinion). 6) UnImplNode - implements Node, Element, NodeList, Document, throws an exception on any method that is not overridden. Very useful for implementing DOM subsets. 7) DOMBuilder -- takes SAX events and adds the result to a document or document fragment. 8) Maybe the string pool -- depending on if we keep the DTM or not (undecided issue). -scott James Duncan Davidson To: Subject: Re: parser-next-gen goals, plan, and requirements 07/11/2000 06:53 PM Please respond to general on 7/11/00 2:42 PM, Scott Boag/CAM/Lotus at Scott_Boag@lotus.com wrote: > First, I would rather see a list of requirements first, rather than goals. > The goal's below are simply mom and apple pie, in my opinion. The devil's > in the details. He he he... > 1) SAX2, of course. +1 > 2) Read-only, memory conservative, high performance DOM subset. In some > ways, this is optional, since the alternative is that the XSLT processor > implement it's own DOM, as it does today. But it would be neat and simpler > if only one DOM implementation needed to exist. +1 -- note that this could be an "optional" DOM shipped as an external .jar file. In fact, I'd like to see as a requirement the ability to build into a set of jars that reflects the modules so that it's clear how to assemble a stripped down parser for whatever use. > 2a) Document-order indexes or API as a DOM extension. I know of few or > no conformant XSLT processors that can do without this. > 2b) [optional] isWhite() method as a DOM extensions (pure telling of > whether or not the text contains non-whitespace), for performance reasons. > 2c) Some sort of weak reference, where nodes could be released if not > referenced, and then rebuilt if requested. For performance and memory > footprint. Ok, these are all requirements on the DOM module. Which one, Read-Only, Read-Write, or both? > 3) parse-next function, with added control over buffer size. Explain more. Would this be the ability to feed in an input source that says "grab 16K at a time from the underlying stream and feed it into the parser"? This puts a requirement on the parser to be able to parser in increments, and a requirement on all the providers to higher level services to provide data to their consumers without having the full picture. > 4) Some sort of way to tell if a SAX char buffer is going to be > overwritten, so data doesn't have to be copied until this occurs. Once again, explain more.. I think that basic programming tenants say that if I hand a buffer to a consumer, whether via the SAX provider, or any other provider, I'm not going to much with it until it's released. > 5) Serialization support, as is currently in Assaf's classes. I've intentially left out serialization as a discussion point to date as it's not parser, it's part of the larger toolset. In my world view, it seems that the serialization (or better called output or externalization in my mind since serialization carries a specific meaning in the Java sense) sits on the other side of the producers from the parser in the diagram that I threw out. > 6) Schema data-type support, which will be needed for XSLT2, and Xalan 2.0 > extensions. Right -- Pluggable validatiors should include Schema, DTD, and possibly Relax if somebody wants to take a crack at it. > 7) We should talk about whether XPath should be part of the core XML > services, rather than part of the XSLT processor. Yes we should. My initial thoughts are no, but... > 8) Small core footprint for standalone, compiled stylesheet capability, for > use on small devices. This would need to include the Serializer. I'm not > sure if this should really be a separate micro-parser? Compiled stylesheets are something that would be different than a parser in mind mind -- wouldn't this be something that sits at the Xalan level? (Hopefully with those folks at Sun helping out :). > +0. I'm not sure this is compatible with the first goal. Also, I would > rather have performance and *scaleable* memory footprint prioritized over > jar file size. However, the Xalan project does need this... Ok -- fair enough. I'd also prio modularization over this if we take it as a good thing to be able to build out into a set of jars where a small non validation SAX only parser could be intuitively and quickly thrown together for a particular application (without a specialized build or diving into the code). This would satisy quite a bit of my needs as far as jar size. >> * Modular. It should be possible to build a parser as a set of Jar > files >> so that a smaller parser can be assembled which fits the need of a >> particular implementation. For example, in TV sets do you really > need >> validation? > > +0, or +1, depending on how you read this. You may not need validation, > but you may indeed need schema processing for data types, entity refs, etc. Right. What I'm thinking is a build target that produces: parser-core.jar validator-dtd.jar validator-schema.jar producer-sax.jar producer-domrw.jar producer-domro.jar producer-jdom.jar Then the person that needs a non validating SAX parser grabs parser-core and producer-sax and goes on leaving the other parts behind. >> * Cleanly Optimized. This means optimized in a way that is compatible >> with modern virtual machines such as HotSpot. Optimizations that > work >> well with JDK 1.1 style VMs can actually impact performance under >> more modern VMs. Optimizations that interfere with readability, >> modularity, or size will be shunned. > > -0 or +1, depending on how you read this. Is it, or is it not, a > requirement to have good performance with JDK 1.1, or even backwards > compatibility? If not, then I think, sure, let's optimize in a way that is > cleanly compatible with "modern" VMs. I think that we have a perfectly good parser answer for JDK 1.1 in the form of Xerces 1.0.x -- I would actually make our target goal 1.2/1.3, with 1.1 compatibility (possibly, I'd actually really like to push forward so that collections and whatever else can be used). What this means is that instead of having to carry a specialized Hashtable (as the one in 1.1 sucked), you use Hashtable and know that on 1.2/1.3 the Hashtable impl is greatly better, and you're happy that it works fine on 1.1, even if it's not as performant. > Big +1. I would like to see this done independent of any next-gen work, > for availability to Xalan 2.0 and other projects, sooner, rather than > later. Ok. Should I propose an apache-auc module to the joint jakarta/xml efforts to collect these sorts of things? We've talked about it on the jakarta lists and said resoundingly "YES" but didn't know how others feel. I think that if there's a loud "YES" here, that we can make headway. And most of my interest really lies at the AUC type level >> * Refactor out a base parser. Once we see how those APIs should look > (or >> at least get a start, they don't have to be perfect :) we start at >> the bottom and look at the code of the existing parsers to come up >> with a basic non-validating parser that can rip through XML. > > -1. I think there is enough knowledge at this point to first put together > a pretty complete design, with a clear understanding of how schema > processing should work with the base parser (maybe they shouldn't -- but I > would argue that point). Hard problem, in my opinion, and more design > rather than less would result in a better idea of what a base parser should > be. Ok.. I think that the API discussion that has started (the one with the diagram and no APIs :) is a start on that. Once that gets to a certain point, then I think that we should get some code rolling though. >> * Set SAX on top of this base parser. Of course. > > +1. However, I think there is likely clear evidence that it may benefit > certain high-performance applications to have a much tighter binding to the > parser than SAX supports. A particular problem is the way that SAX2 treats > character data, and the fact that it's an event-only API, rather than > having by-request characteristics (i.e. parse-next type functionality, so > you can run an incremental parse/transform without having to run an extra > thread). I know that there are others with other opinions on this, I'll defer to them. >> * Look at pluggable validation. > > +1, but validation is not the same thing as basic schema and DTD > processing. Data-types, entity refs, default attributes, etc., tend to be > required. Actually, I'd like to see if that's not the case. I'd really like to see basic schema and DTD processing moved out of the core path so that a non-validating parser can be put together. This really cuts down on the critical path when used in a server case where validation is quite frequently turned off. Same for TV sets or PDAs. There are a whole catagory of apps that fall in to the catagory that validation is not a need after development. > -1 on JDOM for the core. Just my opinion. I don't like it, I think it > misleads developers about the XML data model, and I would rather not see > Apache support it. Producers that sit on top aren't part of the core parser in the currently circulating diagram... It should be made available as part of a full build of XRI or whatever we call this thing. Even if we use a SAX++ for the core internal representation, the SAX producer that produces SAX only events should be a pluggable thing that sits on top of the core parser. .duncan --------------------------------------------------------------------- In case of troubles, e-mail: webmaster@xml.apache.org To unsubscribe, e-mail: general-unsubscribe@xml.apache.org For additional commands, e-mail: general-help@xml.apache.org