xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Clark <an...@apache.org>
Subject Re: How to start writing a non-blocking SAX parser
Date Wed, 01 May 2002 07:06:40 GMT
Aleksander Slominski wrote:
> that is one of reason why standardization of doing XML  pull parsing
> is important.

The Xerces2 model of pull-parsing is very different than XPP
(and other pull parsing APIs). And in reality, it's more like
a hybrid of pull and push parsing. The calls to "parse" are
done in pull fashion but the information from each pull is
pushed to the handlers. 

So putting a real pull parsing API on top of the output from 
an XNI pull parser configuration is what should be done. But
what pull parsing API should that be? 

> it gathers information from callback(s) to return to the user just one event.
> if XNI parse(false)  were too many callbacks the
> xni2xmlpull parser should throw exception (it is _not_ tested ...)

I briefly skimmed the code. My first impression is that I
would have done it differently. (But I think that of any code
written by someone else... ;) 

First, I think I would prefer a different API for pull parsing. 
Just from an object oriented standpoint, I don't like having 
all of the accessor methods on the XmlPullParser interface. I
would have chosen to return different event objects. Then the
event object would have public fields for its data (to avoid
method calls) and specific methods for added functionality.

Some of the extra functionality that I'm referring to would
be methods that make processing XML documents in a pull manner
as easy as possible. The XmlPullParser API has some methods
that do these things. For example, the "nextTag" method lets
the application skip intervening text and just jump to the
element boundaries. Very nice feature. But I would also like
to have a method that allows me to skip to a start element's
end tag, returning all of the text within that element.

Second, assuming that we start from the XmlPullParser API,
then there are a few things that need to be handled within an
implementation driven by an XNI pull parser configuration. I
already mentioned it but I'll state it again for completeness:
event queueing. Due to the pipeline nature of the XNI parser
configuration, when working with generic configurations you
can't guarantee that only one event (or at least one event
that you then forward as a pull parsing event) will occur
during a single call to "parse".

And since event queueing would then require you to buffer
event information, I have a specific feature that I would
add to the Xerces implementation to make it perform a little
better. Arguably, one of the biggest wastes of time would be
the copying of character content from the "characters" and
"ignorableWhitespace" callbacks into a buffer so that runs
of contiguous text can be returned together. So I would add
a feature to Xerces so that the entity scanner (which is 
actually implemented by the entity manager) would not re-use
the character buffers. That way I would not have to copy
any characters at all because I would know that the contents
of the char buffers would not be over-written.

> you can try an old version that i wrote for XPP2
> to use XNI pull parsing, for example look on:
> http://www.extreme.indiana.edu/xgws/xsoap/xpp/download/PullParser2/src/java/x2/

This is the implementation that I was looking at.

> i have also started xni2xmlpull project at sourceforge.net to
> implement XMLPULL V1 API (http://www.xmlpull.org)
> using Xerces2 - the project is under Apache license.

This is where I was browsing the XmlPullParser API.

> i would be very interested in making xni2xmlpull into sub-project
> of Xerces2 as it would allow users of Xerces 2 to access
> and use simple XMLPULL API. my implementation (alpha)
> is very small and together with XMLPULL API classes
> its jar file should be about 20K addition to Xerces2 jar.
> however i do not know what is the current procedure
> to create sub project?

With the current state of the pull API and implementations,
it might be better as it currently is -- hosted elsewhere --
but with a link from the Xerces2 pages listing all of the
projects that are built on (or using) Xerces. This is the
direction that my NekoHTML parser seems to be leaning as

Andy Clark * andyc@apache.org

In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org

View raw message