xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Duncan Davidson <james.david...@eng.sun.com>
Subject [spinnaker] Announce
Date Sat, 08 Jul 2000 05:31:16 GMT

It's been a while since Xerces was launched onto the world. And more
recently we received Crimson to compare it to. From experience and this
comparison, we've found a few things to be evident.

    * Xerces is performant on JDK 1.1 VMs. Very much so. Admirably
      so in fact.

    * Crimson isn't so optimized, yet it runs about as fast as Xerces
      does on modern VMs such as HotSpot. The HotSpot team told us
      that heavily optimized code for 1.1 would not benefit under
      HotSpot. We have the proof now. In fact, there's cases where
      it seems that Xerces slows down.

    * Xerces has a large memory consumption. And a large Jar size.
      This probably wasn't an original design goal, but there are a
      catagory of users that we've talked to that have an issue with

    * Use of Xerces is widespread. Obviously people want a good, high
      quality parser from a free source.

    * Xerces is a great product. It stands well in the marketplace.

    * However, because Xerces was heavily pre-optimized, its
      extremely complex to understand and delve into. I think
      that this is best reflected in that most of the bits that
      go into Xerces come from IBM Cupertino.

    * In our analysis of the Xerces code base, we can't use it for
      future inclusion in the JDK. The pre-optimization is a killer.
      The code-complexity is a killer. And the memory consumption is
      a problem.

These are not unknown problems. Ted L. and I talked about the current Xerces
source base at length at ApacheCon (as we were working out details for
getting Crimson donated). Ted put forward the opinion that it might be best
to do a massive refactoring based on the lessons learned from both parsers.
To essentially ground up a new parser that has a heritage in both existing

I've come to the conclusion that I agree with him. After quite a bit of
discussion, the rest of the XML team at Sun, the people who are responsible
for the parser that will ship in the core of future JDKs, agree as well. It
is important to stress that we want to ship an Apache based parser in the
JDK for all the reasons that you'd expect. Apache code tends to be good
code. The Apache process is one that we beleive in.

So, in the best of Apache traditions, were gonna do something about it. I'm
creating a tree in the xml-contrib area in which to do a lot of code work to
explore how such a new parser could come to be. It's called Spinnaker.

This is the Spinnaker project description based on the README that will get
checked in:

Spinnaker is an attempt to create a next generation Apache XML Parser based
on all the lessons learned from the current versions of Xerces and Crimson.


    * Simple to read, maintainable code. Above all, this is the primary goal
      for any openly developed project as without the ability to read the
      code, it's impossible for people to contribute and get involved.

    * Smallest possible size. This means small distribution size (JAR file)
      and small memory footprint.

    * Modular. It should be possible to build a parser as a set of Jar files
      so that a smaller parser can be assembled which fits the need of a
      particular implementation. For example, in TV sets do you really need

    * Cleanly Optimized. This means optimized in a way that is compatible
      with modern virtual machines such as HotSpot. Optimizations that work
      well with JDK 1.1 style VMs can actually impact performance under
      more modern VMs. Optimizations that interfere with readability,
      modularity, or size will be shunned.

    * Collaboratively Developed. This means that we want *lots* of people
      from diverse backgrounds to participate in this barn raising.


In order to bootstrap what will essentially be a full refactoring of what an
XML parser is (base on our two existing ones), the following is a list of
possible checkpoints to hit.

    * First, factor out utility classes from both the Xerces and Crimson
      source bases. There is a lot of good work on things like the Xerces
      decoders which are faster than the JDK's. This is actually the start
      of an Apache wide common utility set (something that I'd like
      to see in the future as AUC -- Apache Utility Classes). We've talked
      about this before in other Apache projects, and there's a lot of
      good code that we can start it off with here.

    * Determine what the modular API looks like. What are the various
      peices that can be factored out. How can we get to a point where it's
      easy to package a parser that doesn't include DOM or a particular
      validator? There's some work started on a branch, but it hasn't
      been touched in a month or so. This might serve as a start place.

    * Refactor out a base parser. Once we see how those APIs should look (or
      at least get a start, they don't have to be perfect :) we start at
      the bottom and look at the code of the existing parsers to come up
      with a basic non-validating parser that can rip through XML.

    * Set SAX on top of this base parser. Of course.

    * Look at pluggable validation.

    * Factor in tree based producers. We'd like to see DOM and JDOM up

    * Stability. By this point, we should have something that is starting
      to work well. Stability will be a driving goal then.

It should be said up front that this won't happen overnight. It will be a
while before any fruit starts to grow.


So, to close a few thoughts...

Q. Isn't this a slam on the Xerces guys?
A. Nope. This is a natural thing that happens when people get an itch to
scratch in the Apache organization. It should be pointed out that Apache
Webserver 2.0 started out as a thought project, and that the next version of
Tomcat may very well be Catalina which was a similar refactoring of the
current Tomcat source base.

Q. When will this be ready?
A. Damn if I know. Not anytime immediately to be sure. There's a bit of work
to be done.

Q. Where's the repository gonna be?
A. $CVSROOT/xml-contrib/spinnaker

Q. When's the code going to go in?
A. Well, the initial little itty bit that I've done so far to set up a
directory structure and identify a few utility classes is going to be put in
in just a few minutes time after this email goes out. I'll be working on
more pieces throughout the weekend that will beef things up.

Q. Is this Xerces 2.0?
A. No. Not Yet. And maybe Never. It would take the acceptance of the
developer community to be so. For the time being, it's just a code base
where some of us are going to hang out and work. It should be said that
software darwinism could strike and this code base goes absolutely nowhere.
Or, as I hope, this is going to take off and really work out.

Q. Can I help?
A. Duh....

Oh and by the way, to help keep discussion seperate, please use [spinnaker]
in your subject lines. This has been a help on the Tomcat lists.

That's all for now... Let the code start flowing. ;)


View raw message