Return-Path: Mailing-List: contact general-help@xml.apache.org; run by ezmlm Delivered-To: mailing list general@xml.apache.org Received: (qmail 9408 invoked from network); 11 Jul 2000 21:55:30 -0000 Received: from lotus2.lotus.com (192.233.136.8) by locus.apache.org with SMTP; 11 Jul 2000 21:55:30 -0000 Received: from internet2.lotus.com (internet2.lotus.com [9.95.4.236]) by lotus2.lotus.com (8.9.3/8.9.3) with ESMTP id RAA24358 for ; Tue, 11 Jul 2000 17:55:54 -0400 (EDT) Received: from a3mail.lotus.com (A3MAIL.lotus.com [9.95.5.66]) by internet2.lotus.com (8.9.3/8.9.3) with ESMTP id RAA28666 for ; Tue, 11 Jul 2000 17:54:52 -0400 (EDT) Subject: Re: parser-next-gen goals, plan, and requirements To: general@xml.apache.org X-Mailer: Lotus Notes Release 5.0.1 July 16, 1999 Message-ID: From: "Scott Boag/CAM/Lotus" Date: Tue, 11 Jul 2000 17:42:10 -0400 X-MIMETrack: Serialize by Router on A3MAIL/CAM/H/Lotus(Build V505_07062000 |July 6, 2000) at 07/11/2000 05:57:26 PM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii First, I would rather see a list of requirements first, rather than goals. The goal's below are simply mom and apple pie, in my opinion. The devil's in the details. Xalan XSLT Processor Requirements (or requests) on the Parser (my opinions): 1) SAX2, of course. 2) Read-only, memory conservative, high performance DOM subset. In some ways, this is optional, since the alternative is that the XSLT processor implement it's own DOM, as it does today. But it would be neat and simpler if only one DOM implementation needed to exist. 2a) Document-order indexes or API as a DOM extension. I know of few or no conformant XSLT processors that can do without this. 2b) [optional] isWhite() method as a DOM extensions (pure telling of whether or not the text contains non-whitespace), for performance reasons. 2c) Some sort of weak reference, where nodes could be released if not referenced, and then rebuilt if requested. For performance and memory footprint. 3) parse-next function, with added control over buffer size. 4) Some sort of way to tell if a SAX char buffer is going to be overwritten, so data doesn't have to be copied until this occurs. 5) Serialization support, as is currently in Assaf's classes. 6) Schema data-type support, which will be needed for XSLT2, and Xalan 2.0 extensions. 7) We should talk about whether XPath should be part of the core XML services, rather than part of the XSLT processor. 8) Small core footprint for standalone, compiled stylesheet capability, for use on small devices. This would need to include the Serializer. I'm not sure if this should really be a separate micro-parser? > GOALS: > > * Simple to read, maintainable code. Above all, this is the primary goal > for any openly developed project as without the ability to read the > code, it's impossible for people to contribute and get involved. +1. > * Smallest possible size. This means small distribution size (JAR file) > and small memory footprint. +0. I'm not sure this is compatible with the first goal. Also, I would rather have performance and *scaleable* memory footprint prioritized over jar file size. However, the Xalan project does need this... Also, I would like to see packaging options to address the jar file size. I suspect Xerces today could be packaged to a much smaller footprint, if only the base features were used. As I said above, perhaps a separate code-base for a micro parser would be a better option, with support for an XML subset. > * Modular. It should be possible to build a parser as a set of Jar files > so that a smaller parser can be assembled which fits the need of a > particular implementation. For example, in TV sets do you really need > validation? +0, or +1, depending on how you read this. You may not need validation, but you may indeed need schema processing for data types, entity refs, etc. > * Cleanly Optimized. This means optimized in a way that is compatible > with modern virtual machines such as HotSpot. Optimizations that work > well with JDK 1.1 style VMs can actually impact performance under > more modern VMs. Optimizations that interfere with readability, > modularity, or size will be shunned. -0 or +1, depending on how you read this. Is it, or is it not, a requirement to have good performance with JDK 1.1, or even backwards compatibility? If not, then I think, sure, let's optimize in a way that is cleanly compatible with "modern" VMs. > * First, factor out utility classes from both the Xerces and Crimson > source bases. There is a lot of good work on things like the Xerces > decoders which are faster than the JDK's. This is actually the start > of an Apache wide common utility set (something that I'd like > to see in the future as AUC -- Apache Utility Classes). We've talked > about this before in other Apache projects, and there's a lot of > good code that we can start it off with here. Big +1. I would like to see this done independent of any next-gen work, for availability to Xalan 2.0 and other projects, sooner, rather than later. > * Determine what the modular API looks like. What are the various > peices that can be factored out. How can we get to a point where it's > easy to package a parser that doesn't include DOM or a particular > validator? There's some work started on a branch, but it hasn't > been touched in a month or so. This might serve as a start place. +1. > * Refactor out a base parser. Once we see how those APIs should look (or > at least get a start, they don't have to be perfect :) we start at > the bottom and look at the code of the existing parsers to come up > with a basic non-validating parser that can rip through XML. -1. I think there is enough knowledge at this point to first put together a pretty complete design, with a clear understanding of how schema processing should work with the base parser (maybe they shouldn't -- but I would argue that point). Hard problem, in my opinion, and more design rather than less would result in a better idea of what a base parser should be. > * Set SAX on top of this base parser. Of course. +1. However, I think there is likely clear evidence that it may benefit certain high-performance applications to have a much tighter binding to the parser than SAX supports. A particular problem is the way that SAX2 treats character data, and the fact that it's an event-only API, rather than having by-request characteristics (i.e. parse-next type functionality, so you can run an incremental parse/transform without having to run an extra thread). > * Look at pluggable validation. +1, but validation is not the same thing as basic schema and DTD processing. Data-types, entity refs, default attributes, etc., tend to be required. > * Factor in tree based producers. We'd like to see DOM and JDOM up > front. -1 on JDOM for the core. Just my opinion. I don't like it, I think it misleads developers about the XML data model, and I would rather not see Apache support it. > * Stability. By this point, we should have something that is starting > to work well. Stability will be a driving goal then. Sure. -scott