xerces-c-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Heninger" <an...@jtcsv.com>
Subject Re: DOM Performance
Date Thu, 01 Feb 2001 18:32:48 GMT
> From: "Dean Roddey" <droddey@charmedquark.com>
> > The Java parser/DOM does this, and I personally think its more
> > than its worth,

From: "Jon Smirl" <jonsmirl@mediaone.net>
> It's not obvious to me that lazy transcoding is significantly worse even
> you end up touching most of the document. The transcode on demand
> could allow you to control the amount of memory used for transcoded
> instead of forcing it all into memory at once. The smaller memory
> would allow DOM manipulations of large documents without paging.

For the lazy transcoding approach taken by the Xerces-J parser, I
completely agree with Dean.  It's way too complex.  And it was abandoned
in the Xerces 2 Java parser for just this reason.

But there's an intermediate approach that might make sense in some cases.
(This is now bigger than the DOM - it's the whole parser.  And there are
certainly no plans for anybody at IBM to work on anything along these
lines - this is just random thoughts)

Use UTF-8 for the internal character representation, rather than the
existing UTF-16.  All of the mark-up syntax for XML is in the 7 bit ASCII
set, and so would unambiguously appear as single byte characters in UTF-8.

Have a non-validating, not fully checked for well-formedness, slime-ball
parse mode that does not check for correct name characters, but only
considers what can be checked looking at single bytes in UTF-8 - spaces,
quotes, <, >, &, !, CR, LF, etc.  Don't transcode docs that are in an
ASCII based encoding such as ISO-8859-anything.  And deliver document
content back in the encoding in which it was received.

Content in wholly incompatible encodings such as EBCDIC or UTF-16 would
need to be transcoded up front to UTF-8.

Some of the multi-byte Windows encodings for Asian languages - Shift-JIS,
for example - would present an interesting problem.  They would mostly
work, except for '[' and ']', which can appear as the second byte of a
multi-byte character.  So if CDATA sections and DTDs were disallowed,
documents in these encodings could be processed without transcoding as

Andy Heninger
IBM XML Technology Group, Cupertino, CA

View raw message