Mailing-List: contact general-help@xml.apache.org; run by ezmlm
Subject: Re: parser-next-gen goals, plan, and requirements
To: general@xml.apache.org
Message-ID: <OF6E0C2891.7094E4D6-ON85256919.004C848C@lotus.com>
From: "Scott Boag/CAM/Lotus" <Scott_Boag@lotus.com>
Date: Tue, 11 Jul 2000 17:42:10 -0400
MIME-Version: 1.0
Content-type: text/plain; charset=us-ascii


First, I would rather see a list of requirements first, rather than goals.
The goal's below are simply mom and apple pie, in my opinion.  The devil's
in the details.

Xalan XSLT Processor Requirements (or requests) on the Parser (my
opinions):

1) SAX2, of course.
2) Read-only, memory conservative, high performance DOM subset.  In some
ways, this is optional, since the alternative is that the XSLT processor
implement it's own DOM, as it does today.  But it would be neat and simpler
if only one DOM implementation needed to exist.
  2a) Document-order indexes or API as a DOM extension.  I know of few or
no conformant XSLT processors that can do without this.
  2b) [optional] isWhite() method as a DOM extensions (pure telling of
whether or not the text contains non-whitespace), for performance reasons.
  2c) Some sort of weak reference, where nodes could be released if not
referenced, and then rebuilt if requested.  For performance and memory
footprint.
3) parse-next function, with added control over buffer size.
4) Some sort of way to tell if a SAX char buffer is going to be
overwritten, so data doesn't have to be copied until this occurs.
5) Serialization support, as is currently in Assaf's classes.
6) Schema data-type support, which will be needed for XSLT2, and Xalan 2.0
extensions.
7) We should talk about whether XPath should be part of the core XML
services, rather than part of the XSLT processor.
8) Small core footprint for standalone, compiled stylesheet capability, for
use on small devices.  This would need to include the Serializer.  I'm not
sure if this should really be a separate micro-parser?


> GOALS:
>
>     * Simple to read, maintainable code. Above all, this is the primary
goal
>       for any openly developed project as without the ability to read the
>       code, it's impossible for people to contribute and get involved.

+1.

>     * Smallest possible size. This means small distribution size (JAR
file)
>       and small memory footprint.

+0.  I'm not sure this is compatible with the first goal.  Also, I would
rather have performance and *scaleable* memory footprint prioritized over
jar file size.  However, the Xalan project does need this...

Also, I would like to see packaging options to address the jar file size.
I suspect Xerces today could be packaged to a much smaller footprint, if
only the base features were used.

As I said above, perhaps a separate code-base for a micro parser would be a
better option, with support for an XML subset.

>     * Modular. It should be possible to build a parser as a set of Jar
files
>       so that a smaller parser can be assembled which fits the need of a
>       particular implementation. For example, in TV sets do you really
need
>       validation?

+0, or +1, depending on how you read this.  You may not need validation,
but you may indeed need schema processing for data types, entity refs, etc.

>    * Cleanly Optimized. This means optimized in a way that is compatible
>      with modern virtual machines such as HotSpot. Optimizations that
work
>      well with JDK 1.1 style VMs can actually impact performance under
>      more modern VMs. Optimizations that interfere with readability,
>      modularity, or size will be shunned.

-0 or +1, depending on how you read this.  Is it, or is it not, a
requirement to have good performance with JDK 1.1, or even backwards
compatibility?  If not, then I think, sure, let's optimize in a way that is
cleanly compatible with "modern" VMs.

>   * First, factor out utility classes from both the Xerces and Crimson
>       source bases. There is a lot of good work on things like the Xerces
>       decoders which are faster than the JDK's. This is actually the
start
>       of an Apache wide common utility set (something that I'd like
>       to see in the future as AUC -- Apache Utility Classes). We've
talked
>       about this before in other Apache projects, and there's a lot of
>       good code that we can start it off with here.

Big +1.  I would like to see this done independent of any next-gen work,
for availability to Xalan 2.0 and other projects, sooner, rather than
later.

>     * Determine what the modular API looks like. What are the various
>       peices that can be factored out. How can we get to a point where
it's
>       easy to package a parser that doesn't include DOM or a particular
>       validator? There's some work started on a branch, but it hasn't
>       been touched in a month or so. This might serve as a start place.

+1.

>    * Refactor out a base parser. Once we see how those APIs should look
(or
>      at least get a start, they don't have to be perfect :) we start at
>      the bottom and look at the code of the existing parsers to come up
>      with a basic non-validating parser that can rip through XML.

-1.  I think there is enough knowledge at this point to first put together
a pretty complete design, with a clear understanding of how schema
processing should work with the base parser (maybe they shouldn't -- but I
would argue that point).  Hard problem, in my opinion, and more design
rather than less would result in a better idea of what a base parser should
be.

>    * Set SAX on top of this base parser. Of course.

+1. However, I think there is likely clear evidence that it may benefit
certain high-performance applications to have a much tighter binding to the
parser than SAX supports.  A particular problem is the way that SAX2 treats
character data, and the fact that it's an event-only API, rather than
having by-request characteristics (i.e. parse-next type functionality, so
you can run an incremental parse/transform without having to run an extra
thread).

>    * Look at pluggable validation.

+1, but validation is not the same thing as basic schema and DTD
processing.  Data-types, entity refs, default attributes, etc., tend to be
required.

>    * Factor in tree based producers. We'd like to see DOM and JDOM up
>      front.

-1 on JDOM for the core.  Just my opinion.  I don't like it, I think it
misleads developers about the XML data model, and I would rather not see
Apache support it.

>    * Stability. By this point, we should have something that is starting
>      to work well. Stability will be a driving goal then.

Sure.

-scott