xerces-c-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jon Smirl" <jonsm...@mediaone.net>
Subject Re: DOM Performance
Date Thu, 01 Feb 2001 19:04:19 GMT
How about this variation on the theme...

Merge the input transcoding logic and tokenizing of XML into a single piece
of code. Then build a front end for each of the encoding groups. Xerces
already has all of this logic in it, it's just combined in a different way.

Are there only four major groupings or are there more?
1) ISO-8859, UTF-8 family
2) UTF-16
4) Shift-JIS

The events from the parser would then be something like:
   start of document, here's a pointer to the transcoding function
   start tag, here's a pointer to the name in binary and a length
   attribute, here's a pointer to the name in binary and a length
   text node, , here's a pointer to the name in binary and a length

The DOM would store the binary pointers initially to allow mem-mapped files.
If the DOM node was accessed the binary buffer would be lazy transcoded into
Unicode using the initial transcoding function. In other words I would treat
the internal character representation as being opaque with the only legal
operations being copy or transcode.

To make this more efficient for tags and attributes building a 'discovered'
schema cache with transcoded tag and attribute names would be useful.

From: "Andy Heninger" <andyh@jtcsv.com>
> Use UTF-8 for the internal character representation, rather than the
> existing UTF-16.  All of the mark-up syntax for XML is in the 7 bit ASCII
> set, and so would unambiguously appear as single byte characters in UTF-8.
> Have a non-validating, not fully checked for well-formedness, slime-ball
> parse mode that does not check for correct name characters, but only
> considers what can be checked looking at single bytes in UTF-8 - spaces,
> quotes, <, >, &, !, CR, LF, etc.  Don't transcode docs that are in an
> ASCII based encoding such as ISO-8859-anything.  And deliver document
> content back in the encoding in which it was received.

Jon Smirl

View raw message