Mailing-List: contact general-help@xml.apache.org; run by ezmlm
Subject: Re: parser-next-gen goals, plan, and requirements
To: general@xml.apache.org
Message-ID: <OF92791EC8.1BC95E60-ON8525691A.005F24A1@lotus.com>
From: "Scott Boag/CAM/Lotus" <Scott_Boag@lotus.com>
Date: Wed, 12 Jul 2000 13:59:46 -0400
MIME-Version: 1.0
Content-type: text/plain; charset=us-ascii


James Duncan Davidson <james.davidson@eng.sun.com> wrote:
> +1 -- note that this could be an "optional" DOM shipped as an external
.jar
> file. In fact, I'd like to see as a requirement the ability to build into
a
> set of jars that reflects the modules so that it's clear how to assemble
a
> stripped down parser for whatever use.

I would like to see this "layered" above the read-write DOM, so that the
read-write DOM would work exactly as well.  This would go a long way in
decreasing testing pain.

> Ok, these are all requirements on the DOM module. Which one, Read-Only,
> Read-Write, or both?

As I said, preferably on the read-only, from which the read-write could be
derived.  We'll always have to handle a *pure* DOM interface, without these
features, but it would be nice if performance doesn't massively degrade if
you choose to use a read-write DOM.

> > 3) parse-next function, with added control over buffer size.
>
> Explain more. Would this be the ability to feed in an input source that
says
> "grab 16K at a time from the underlying stream and feed it into the
parser"?

Exactly.  Xerces1 does this now, but I don't think you have much control
over the size of the parse block.

> Once again, explain more.. I think that basic programming tenants say
that
> if I hand a buffer to a consumer, whether via the SAX provider, or any
other
> provider, I'm not going to much with it until it's released.

SAX calls a character event like so:

  /**
   * Receive notification of character data inside an element.
   *
   * @param ch The characters.
   * @param start The start position in the character array.
   * @param length The number of characters to use from the
   *               character array.
   * @exception org.xml.sax.SAXException Any SAX exception, possibly
   *            wrapping another exception.
   * @see org.xml.sax.ContentHandler#characters
   */
  public void characters (char ch[], int start, int length)
    throws SAXException;

Typically, you will see the same array reference passed for over and over,
with the start argument increasing, then, the parser will fill up it's
buffer, and start from the beginning again.  What this means is that the
processor has to copy the characters over to a new array... so you end up
with the characters:
1) copied from the original byte stream to a UNICODE array with entities
expanded, etc.,
2) copied from the SAX event buffer to a stable character array/string for
the source tree (DOM) to hold,
3) processed character-by-character for de-encoding to the result stream.

It would be nice to not do (2) when feasible.  Note that, often (80% of the
time?), you end up with the exact array of bytes in the output that you had
in the input... which is one reason why a tight coupling with the parser
could be used to get really fast performance... if the right context is
sensed, the bytes could be copied directly from the input to the output.
Note that issues like this are still a far bigger win, in my (potentially
flawed) estimation, than compiling a stylesheet to bytecodes.  I've talked
to James Clark about this, who has much longer ranging experience in this
type of processing than I, and he agrees.

> I've intentially left out serialization as a discussion point to date as
> it's not parser, it's part of the larger toolset.

Hmm... perhaps we should start the architecture (but not the
requirements... in some sense it's immaterial to the requirements where the
feature comes from) from this bigger picture.

> Compiled stylesheets are something that would be different than a parser
in
> mind mind -- wouldn't this be something that sits at the Xalan level?

I didn't mean to imply otherwise... but I don't think the microparser would
be at the Xalan level.  I could be wrong.  By what I could tell by dumping
the Translets jar, they seemed to have a separate parser, not a tightly
bound one.  I haven't been able to get at the bottom of where parsing
services fits into the Translets picture, in Sun's vision of it.

> (Hopefully with those folks at Sun helping out <ducking>:).

No need to duck, I'm hopeful too (as long as it is a collaborative effort).

> Right. What I'm thinking is a build target that produces:
>
>    parser-core.jar
>    validator-dtd.jar
>    validator-schema.jar
>    producer-sax.jar
>    producer-domrw.jar
>    producer-domro.jar
>    producer-jdom.jar

Fine.

> I would actually make our target goal 1.2/1.3, with 1.1
> compatibility (possibly, I'd actually really like to push forward so that
> collections and whatever else can be used).

I'm fine with that.  I've been toying with using weak references in Xalan2,
which would break 1.1 compatibility.  I really need them.  Are there any
opinions from the Cocoon folks on this (from the standpoint of Xalan)?

> Ok. Should I propose an apache-auc module to the joint jakarta/xml
efforts
> to collect these sorts of things?

Fine by me.  I need:

1) Probably Heap and HeapObject(though I have not started using this yet).
Implementation is no big deal, so I'm just as happy to have a proprietary
one in Xalan.
2) ObjectPool object.  Implementation is no big deal, so I'm just as happy
to have a proprietary one in Xalan.
3) PrefixResolver -- The class that implements this interface can resolve
prefixes to namespaces -- needed for XPath interface support.
4) QName -- object to represent a qualified name, needs to be shared with
the Serializer.  Currently part of TRaX, should be in SAX.
5) SystemIDResolver -- Take a SystemID string and try and turn it into a
good absolute URL (see org.apache.xerces.utils.URI, which is a great bit of
work, in my opinion).
6) UnImplNode -  implements Node, Element, NodeList, Document, throws an
exception on any method that is not overridden.  Very useful for
implementing DOM subsets.
7) DOMBuilder -- takes SAX events and adds the result to a document or
document fragment.
8) Maybe the string pool -- depending on if we keep the DTM or not
(undecided issue).

-scott


                    James Duncan                                                                                        
                    Davidson                  To:     <general@xml.apache.org>                                          
                    <james.davidson@en        cc:     (bcc: Scott Boag/CAM/Lotus)                                       
                    g.sun.com>                Subject:     Re: parser-next-gen goals, plan, and requirements            
                                                                                                                        
                    07/11/2000 06:53                                                                                    
                    PM                                                                                                  
                    Please respond to                                                                                   
                    general                                                                                             
                                                                                                                        
                                                                                                                        
on 7/11/00 2:42 PM, Scott Boag/CAM/Lotus at Scott_Boag@lotus.com wrote:

> First, I would rather see a list of requirements first, rather than
goals.
> The goal's below are simply mom and apple pie, in my opinion.  The
devil's
> in the details.

He he he...

> 1) SAX2, of course.

+1

> 2) Read-only, memory conservative, high performance DOM subset.  In some
> ways, this is optional, since the alternative is that the XSLT processor
> implement it's own DOM, as it does today.  But it would be neat and
simpler
> if only one DOM implementation needed to exist.

+1 -- note that this could be an "optional" DOM shipped as an external .jar
file. In fact, I'd like to see as a requirement the ability to build into a
set of jars that reflects the modules so that it's clear how to assemble a
stripped down parser for whatever use.

> 2a) Document-order indexes or API as a DOM extension.  I know of few or
> no conformant XSLT processors that can do without this.
> 2b) [optional] isWhite() method as a DOM extensions (pure telling of
> whether or not the text contains non-whitespace), for performance
reasons.
> 2c) Some sort of weak reference, where nodes could be released if not
> referenced, and then rebuilt if requested.  For performance and memory
> footprint.

Ok, these are all requirements on the DOM module. Which one, Read-Only,
Read-Write, or both?

> 3) parse-next function, with added control over buffer size.

Explain more. Would this be the ability to feed in an input source that
says
"grab 16K at a time from the underlying stream and feed it into the
parser"?
This puts a requirement on the parser to be able to parser in increments,
and a requirement on all the providers to higher level services to provide
data to their consumers without having the full picture.

> 4) Some sort of way to tell if a SAX char buffer is going to be
> overwritten, so data doesn't have to be copied until this occurs.

Once again, explain more.. I think that basic programming tenants say that
if I hand a buffer to a consumer, whether via the SAX provider, or any
other
provider, I'm not going to much with it until it's released.

> 5) Serialization support, as is currently in Assaf's classes.

I've intentially left out serialization as a discussion point to date as
it's not parser, it's part of the larger toolset. In my world view, it
seems
that the serialization (or better called output or externalization in my
mind since serialization carries a specific meaning in the Java sense) sits
on the other side of the producers from the parser in the diagram that I
threw out.

> 6) Schema data-type support, which will be needed for XSLT2, and Xalan
2.0
> extensions.

Right -- Pluggable validatiors should include Schema, DTD, and possibly
Relax if somebody wants to take a crack at it.

> 7) We should talk about whether XPath should be part of the core XML
> services, rather than part of the XSLT processor.

Yes we should. My initial thoughts are no, but...

> 8) Small core footprint for standalone, compiled stylesheet capability,
for
> use on small devices.  This would need to include the Serializer.  I'm
not
> sure if this should really be a separate micro-parser?

Compiled stylesheets are something that would be different than a parser in
mind mind -- wouldn't this be something that sits at the Xalan level?
(Hopefully with those folks at Sun helping out <ducking>:).

> +0.  I'm not sure this is compatible with the first goal.  Also, I would
> rather have performance and *scaleable* memory footprint prioritized over
> jar file size.  However, the Xalan project does need this...

Ok -- fair enough. I'd also prio modularization over this if we take it as
a
good thing to be able to build out into a set of jars where a small non
validation SAX only parser could be intuitively and quickly thrown together
for a particular application (without a specialized build or diving into
the
code). This would satisy quite a bit of my needs as far as jar size.

>> * Modular. It should be possible to build a parser as a set of Jar
> files
>> so that a smaller parser can be assembled which fits the need of a
>> particular implementation. For example, in TV sets do you really
> need
>> validation?
>
> +0, or +1, depending on how you read this.  You may not need validation,
> but you may indeed need schema processing for data types, entity refs,
etc.

Right. What I'm thinking is a build target that produces:

    parser-core.jar
    validator-dtd.jar
    validator-schema.jar
    producer-sax.jar
    producer-domrw.jar
    producer-domro.jar
    producer-jdom.jar

Then the person that needs a non validating SAX parser grabs parser-core
and
producer-sax and goes on leaving the other parts behind.

>> * Cleanly Optimized. This means optimized in a way that is compatible
>> with modern virtual machines such as HotSpot. Optimizations that
> work
>> well with JDK 1.1 style VMs can actually impact performance under
>> more modern VMs. Optimizations that interfere with readability,
>> modularity, or size will be shunned.
>
> -0 or +1, depending on how you read this.  Is it, or is it not, a
> requirement to have good performance with JDK 1.1, or even backwards
> compatibility?  If not, then I think, sure, let's optimize in a way that
is
> cleanly compatible with "modern" VMs.

I think that we have a perfectly good parser answer for JDK 1.1 in the form
of Xerces 1.0.x -- I would actually make our target goal 1.2/1.3, with 1.1
compatibility (possibly, I'd actually really like to push forward so that
collections and whatever else can be used).

What this means is that instead of having to carry a specialized Hashtable
(as the one in 1.1 sucked), you use Hashtable and know that on 1.2/1.3 the
Hashtable impl is greatly better, and you're happy that it works fine on
1.1, even if it's not as performant.

> Big +1.  I would like to see this done independent of any next-gen work,
> for availability to Xalan 2.0 and other projects, sooner, rather than
> later.

Ok. Should I propose an apache-auc module to the joint jakarta/xml efforts
to collect these sorts of things? We've talked about it on the jakarta
lists
and said resoundingly "YES" but didn't know how others feel. I think that
if
there's a loud "YES" here, that we can make headway.

And most of my interest really lies at the AUC type level

>> * Refactor out a base parser. Once we see how those APIs should look
> (or
>> at least get a start, they don't have to be perfect :) we start at
>> the bottom and look at the code of the existing parsers to come up
>> with a basic non-validating parser that can rip through XML.
>
> -1.  I think there is enough knowledge at this point to first put
together
> a pretty complete design, with a clear understanding of how schema
> processing should work with the base parser (maybe they shouldn't -- but
I
> would argue that point).  Hard problem, in my opinion, and more design
> rather than less would result in a better idea of what a base parser
should
> be.

Ok.. I think that the API discussion that has started (the one with the
diagram and no APIs :) is a start on that. Once that gets to a certain
point, then I think that we should get some code rolling though.

>> * Set SAX on top of this base parser. Of course.
>
> +1. However, I think there is likely clear evidence that it may benefit
> certain high-performance applications to have a much tighter binding to
the
> parser than SAX supports.  A particular problem is the way that SAX2
treats
> character data, and the fact that it's an event-only API, rather than
> having by-request characteristics (i.e. parse-next type functionality, so
> you can run an incremental parse/transform without having to run an extra
> thread).

I know that there are others with other opinions on this, I'll defer to
them.

>> * Look at pluggable validation.
>
> +1, but validation is not the same thing as basic schema and DTD
> processing.  Data-types, entity refs, default attributes, etc., tend to
be
> required.

Actually, I'd like to see if that's not the case. I'd really like to see
basic schema and DTD processing moved out of the core path so that a
non-validating parser can be put together.

This really cuts down on the critical path when used in a server case where
validation is quite frequently turned off. Same for TV sets or PDAs. There
are a whole catagory of apps that fall in to the catagory that validation
is
not a need after development.

> -1 on JDOM for the core.  Just my opinion.  I don't like it, I think it
> misleads developers about the XML data model, and I would rather not see
> Apache support it.

Producers that sit on top aren't part of the core parser in the currently
circulating diagram... It should be made available as part of a full build
of XRI or whatever we call this thing. Even if we use a SAX++ for the core
internal representation, the SAX producer that produces SAX only events
should be a pluggable thing that sits on top of the core parser.

.duncan


---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org