xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Costin Manolache <cos...@eng.sun.com>
Subject Re: parser-next-gen goals, plan, and requirements
Date Wed, 12 Jul 2000 00:31:12 GMT
> > 2) Read-only, memory conservative, high performance DOM subset.  In some
> > ways, this is optional, since the alternative is that the XSLT processor
> > implement it's own DOM, as it does today.  But it would be neat and simpler
> > if only one DOM implementation needed to exist.
> +1 -- note that this could be an "optional" DOM shipped as an external .jar
> file. In fact, I'd like to see as a requirement the ability to build into a
> set of jars that reflects the modules so that it's clear how to assemble a
> stripped down parser for whatever use.

I think it's a good idea to have multiple DOM modules, but if it is possible
to implement whatever extensions xalan requires into default DOM
( or in the internal APIs)   we should do it.

It seems the xalan-2 DOM is very clean and can easily be moved into
spinnaker as a module.

> > 3) parse-next function, with added control over buffer size.
> Explain more. Would this be the ability to feed in an input source that says
> "grab 16K at a time from the underlying stream and feed it into the parser"?
> This puts a requirement on the parser to be able to parser in increments,
> and a requirement on all the providers to higher level services to provide
> data to their consumers without having the full picture.

I guess it's a good point - you should be able to parse the document in
an iterative way:

parseNext(ParseState)  will parse the next element or char chunk.

SAX is a great API, but this kind of API may be much better as
an internal API.

One very interesting extension of this would be to do something
like parseAtOffset( int off), which will read the next element
starting with a certain file offset. This combined with a cache
may save us from storing very large documents in memory.

We should explore this !

> > 4) Some sort of way to tell if a SAX char buffer is going to be
> > overwritten, so data doesn't have to be copied until this occurs.

We have a similar problem in tomcat ( attempting to avoid
copy ), one way to resolve that would be to expose the
buffer via the internal API.

I think buffering and caching are vital to achieve performance
( by design , no pre-optimization here :-), and we should have full
control over that. Assumig a (pool of) 4k buffers are used to
read, you should be able to pin the buffer or be notified when
the buffer change.

( it may sound complex, but it would be great to have - maybe
as a goal, not a requirement )

> > Big +1.  I would like to see this done independent of any next-gen work,
> > for availability to Xalan 2.0 and other projects, sooner, rather than
> > later.
> Ok. Should I propose an apache-auc module to the joint jakarta/xml efforts
> to collect these sorts of things? We've talked about it on the jakarta lists
> and said resoundingly "YES" but didn't know how others feel. I think that if
> there's a loud "YES" here, that we can make headway.
> And most of my interest really lies at the AUC type level

That would be great - there is a lot of great code in all apache projects,
not only xerces, but also tomcat ( thread pools, logger, etc),  and so on,
and it will be really great if we can reuse code from one project to another.

StringTable ( or StringPool ) will provide a great benefit in tomcat for
example, assuming we could clean up the interfaces a bit.

This may also be a great way to keep alive some of the ( great IMHO )
1.1 optimizations that are now part of xerces, for example as a set of
1.1 modules. We will need a good set of interfaces, but it will have many
benefits : we may end up writing modules optimized for various
configurations ( low memory, embeded, jits), and do that without
adding any complexity to the project that uses them.

( another good example is a common Resource/Messages/whatever
module for I18N, and a common logger ).


View raw message