xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Duncan Davidson <james.david...@eng.sun.com>
Subject Re: parser-next-gen goals, plan, and requirements
Date Tue, 11 Jul 2000 22:53:12 GMT
on 7/11/00 2:42 PM, Scott Boag/CAM/Lotus at Scott_Boag@lotus.com wrote:

> First, I would rather see a list of requirements first, rather than goals.
> The goal's below are simply mom and apple pie, in my opinion.  The devil's
> in the details.

He he he... 

> 1) SAX2, of course.

+1

> 2) Read-only, memory conservative, high performance DOM subset.  In some
> ways, this is optional, since the alternative is that the XSLT processor
> implement it's own DOM, as it does today.  But it would be neat and simpler
> if only one DOM implementation needed to exist.

+1 -- note that this could be an "optional" DOM shipped as an external .jar
file. In fact, I'd like to see as a requirement the ability to build into a
set of jars that reflects the modules so that it's clear how to assemble a
stripped down parser for whatever use.

> 2a) Document-order indexes or API as a DOM extension.  I know of few or
> no conformant XSLT processors that can do without this.
> 2b) [optional] isWhite() method as a DOM extensions (pure telling of
> whether or not the text contains non-whitespace), for performance reasons.
> 2c) Some sort of weak reference, where nodes could be released if not
> referenced, and then rebuilt if requested.  For performance and memory
> footprint.

Ok, these are all requirements on the DOM module. Which one, Read-Only,
Read-Write, or both?

> 3) parse-next function, with added control over buffer size.

Explain more. Would this be the ability to feed in an input source that says
"grab 16K at a time from the underlying stream and feed it into the parser"?
This puts a requirement on the parser to be able to parser in increments,
and a requirement on all the providers to higher level services to provide
data to their consumers without having the full picture.

> 4) Some sort of way to tell if a SAX char buffer is going to be
> overwritten, so data doesn't have to be copied until this occurs.

Once again, explain more.. I think that basic programming tenants say that
if I hand a buffer to a consumer, whether via the SAX provider, or any other
provider, I'm not going to much with it until it's released.

> 5) Serialization support, as is currently in Assaf's classes.

I've intentially left out serialization as a discussion point to date as
it's not parser, it's part of the larger toolset. In my world view, it seems
that the serialization (or better called output or externalization in my
mind since serialization carries a specific meaning in the Java sense) sits
on the other side of the producers from the parser in the diagram that I
threw out.

> 6) Schema data-type support, which will be needed for XSLT2, and Xalan 2.0
> extensions.

Right -- Pluggable validatiors should include Schema, DTD, and possibly
Relax if somebody wants to take a crack at it.

> 7) We should talk about whether XPath should be part of the core XML
> services, rather than part of the XSLT processor.

Yes we should. My initial thoughts are no, but...

> 8) Small core footprint for standalone, compiled stylesheet capability, for
> use on small devices.  This would need to include the Serializer.  I'm not
> sure if this should really be a separate micro-parser?

Compiled stylesheets are something that would be different than a parser in
mind mind -- wouldn't this be something that sits at the Xalan level?
(Hopefully with those folks at Sun helping out <ducking>:).

> +0.  I'm not sure this is compatible with the first goal.  Also, I would
> rather have performance and *scaleable* memory footprint prioritized over
> jar file size.  However, the Xalan project does need this...

Ok -- fair enough. I'd also prio modularization over this if we take it as a
good thing to be able to build out into a set of jars where a small non
validation SAX only parser could be intuitively and quickly thrown together
for a particular application (without a specialized build or diving into the
code). This would satisy quite a bit of my needs as far as jar size.

>> * Modular. It should be possible to build a parser as a set of Jar
> files
>> so that a smaller parser can be assembled which fits the need of a
>> particular implementation. For example, in TV sets do you really
> need
>> validation?
> 
> +0, or +1, depending on how you read this.  You may not need validation,
> but you may indeed need schema processing for data types, entity refs, etc.

Right. What I'm thinking is a build target that produces:

    parser-core.jar
    validator-dtd.jar
    validator-schema.jar
    producer-sax.jar
    producer-domrw.jar
    producer-domro.jar
    producer-jdom.jar

Then the person that needs a non validating SAX parser grabs parser-core and
producer-sax and goes on leaving the other parts behind.

>> * Cleanly Optimized. This means optimized in a way that is compatible
>> with modern virtual machines such as HotSpot. Optimizations that
> work
>> well with JDK 1.1 style VMs can actually impact performance under
>> more modern VMs. Optimizations that interfere with readability,
>> modularity, or size will be shunned.
> 
> -0 or +1, depending on how you read this.  Is it, or is it not, a
> requirement to have good performance with JDK 1.1, or even backwards
> compatibility?  If not, then I think, sure, let's optimize in a way that is
> cleanly compatible with "modern" VMs.

I think that we have a perfectly good parser answer for JDK 1.1 in the form
of Xerces 1.0.x -- I would actually make our target goal 1.2/1.3, with 1.1
compatibility (possibly, I'd actually really like to push forward so that
collections and whatever else can be used).

What this means is that instead of having to carry a specialized Hashtable
(as the one in 1.1 sucked), you use Hashtable and know that on 1.2/1.3 the
Hashtable impl is greatly better, and you're happy that it works fine on
1.1, even if it's not as performant.

> Big +1.  I would like to see this done independent of any next-gen work,
> for availability to Xalan 2.0 and other projects, sooner, rather than
> later.

Ok. Should I propose an apache-auc module to the joint jakarta/xml efforts
to collect these sorts of things? We've talked about it on the jakarta lists
and said resoundingly "YES" but didn't know how others feel. I think that if
there's a loud "YES" here, that we can make headway.

And most of my interest really lies at the AUC type level

>> * Refactor out a base parser. Once we see how those APIs should look
> (or
>> at least get a start, they don't have to be perfect :) we start at
>> the bottom and look at the code of the existing parsers to come up
>> with a basic non-validating parser that can rip through XML.
> 
> -1.  I think there is enough knowledge at this point to first put together
> a pretty complete design, with a clear understanding of how schema
> processing should work with the base parser (maybe they shouldn't -- but I
> would argue that point).  Hard problem, in my opinion, and more design
> rather than less would result in a better idea of what a base parser should
> be.

Ok.. I think that the API discussion that has started (the one with the
diagram and no APIs :) is a start on that. Once that gets to a certain
point, then I think that we should get some code rolling though.

>> * Set SAX on top of this base parser. Of course.
> 
> +1. However, I think there is likely clear evidence that it may benefit
> certain high-performance applications to have a much tighter binding to the
> parser than SAX supports.  A particular problem is the way that SAX2 treats
> character data, and the fact that it's an event-only API, rather than
> having by-request characteristics (i.e. parse-next type functionality, so
> you can run an incremental parse/transform without having to run an extra
> thread).

I know that there are others with other opinions on this, I'll defer to
them.

>> * Look at pluggable validation.
> 
> +1, but validation is not the same thing as basic schema and DTD
> processing.  Data-types, entity refs, default attributes, etc., tend to be
> required.

Actually, I'd like to see if that's not the case. I'd really like to see
basic schema and DTD processing moved out of the core path so that a
non-validating parser can be put together.

This really cuts down on the critical path when used in a server case where
validation is quite frequently turned off. Same for TV sets or PDAs. There
are a whole catagory of apps that fall in to the catagory that validation is
not a need after development.

> -1 on JDOM for the core.  Just my opinion.  I don't like it, I think it
> misleads developers about the XML data model, and I would rather not see
> Apache support it.

Producers that sit on top aren't part of the core parser in the currently
circulating diagram... It should be made available as part of a full build
of XRI or whatever we call this thing. Even if we use a SAX++ for the core
internal representation, the SAX producer that produces SAX only events
should be a pluggable thing that sits on top of the core parser.

.duncan


Mime
View raw message