xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Clark <an...@apache.org>
Subject Re: Xerces2 requirements
Date Tue, 19 Sep 2000 19:29:02 GMT
Scott_Boag@lotus.com wrote:
> Perhaps to expand on this... there should be some way to get to 
> the raw, unencoded character buffer for text nodes, and to have a 
> way for the parser to not encode the text if a switch is thrown.  

For the rest of the people involved in this discussion, I want
to point out that what Scott is talking about is really the
underlying bytes of the input stream. Typically, the use of 
the word "character" implies that the byte(s) have been 
transcoded into the Unicode character already.

> The reason is for high performance transformation when the input 
> encoding is the same as the output encoding, and the text doesn't 
> have to be explored by either the parser or the transformer.  

I can understand the performance benefit but it's really not
going to be possible to do this. It adds an amazing amount of
complexity to the parser and tree model implementation. I feel 
that it would be at an unacceptable level for source that needs
to be maintained and extended in the future.

> Sorry, I know this sounds hard, but we need a way of super-
> charging certain types of (e-business) transformations.

Then those people will need a custom parser to support their
needs and get the performance they require. But I think that
it would cripple the Xerces parser and we'd be back where we
started. The current code is closer to being able to support
this kind of feature because it defers transcoding and keeps
the underlying byte buffers around until needed. And the
state of the current code is why we're working on the
requirements for the next version.

Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org

View raw message