xerces-c-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Heninger" <an...@jtcsv.com>
Subject Re: DOM Performance
Date Thu, 01 Feb 2001 18:32:48 GMT
> From: "Dean Roddey" <droddey@charmedquark.com>
> > The Java parser/DOM does this, and I personally think its more
complexity
> > than its worth,

From: "Jon Smirl" <jonsmirl@mediaone.net>
> It's not obvious to me that lazy transcoding is significantly worse even
if
> you end up touching most of the document. The transcode on demand
strategy
> could allow you to control the amount of memory used for transcoded
buffers
> instead of forcing it all into memory at once. The smaller memory
footprint
> would allow DOM manipulations of large documents without paging.

For the lazy transcoding approach taken by the Xerces-J parser, I
completely agree with Dean.  It's way too complex.  And it was abandoned
in the Xerces 2 Java parser for just this reason.

But there's an intermediate approach that might make sense in some cases.
(This is now bigger than the DOM - it's the whole parser.  And there are
certainly no plans for anybody at IBM to work on anything along these
lines - this is just random thoughts)

Use UTF-8 for the internal character representation, rather than the
existing UTF-16.  All of the mark-up syntax for XML is in the 7 bit ASCII
set, and so would unambiguously appear as single byte characters in UTF-8.

Have a non-validating, not fully checked for well-formedness, slime-ball
parse mode that does not check for correct name characters, but only
considers what can be checked looking at single bytes in UTF-8 - spaces,
quotes, <, >, &, !, CR, LF, etc.  Don't transcode docs that are in an
ASCII based encoding such as ISO-8859-anything.  And deliver document
content back in the encoding in which it was received.

Content in wholly incompatible encodings such as EBCDIC or UTF-16 would
need to be transcoded up front to UTF-8.

Some of the multi-byte Windows encodings for Asian languages - Shift-JIS,
for example - would present an interesting problem.  They would mostly
work, except for '[' and ']', which can appear as the second byte of a
multi-byte character.  So if CDATA sections and DTDs were disallowed,
documents in these encodings could be processed without transcoding as
well.


Andy Heninger
IBM XML Technology Group, Cupertino, CA
heninger@us.ibm.com



Mime
View raw message