xerces-c-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alberto Massari (JIRA)" <xerces-c-...@xml.apache.org>
Subject [jira] Updated: (XERCESC-770) IANA charset names list inefficient; useful?
Date Tue, 02 Nov 2004 14:02:09 GMT
     [ http://nagoya.apache.org/jira/browse/XERCESC-770?page=history ]

Alberto Massari updated XERCESC-770:

    Priority: Major

> IANA charset names list inefficient; useful?
> --------------------------------------------
>          Key: XERCESC-770
>          URL: http://nagoya.apache.org/jira/browse/XERCESC-770
>      Project: Xerces-C++
>         Type: Bug
>   Components: Utilities
>     Versions: 2.1.0
>  Environment: Operating System: All
> Platform: All
>     Reporter: Markus Scherer
>     Assignee: Xerces-C Developers Mailing List

> The IANA charset names list is stored inefficiently. It alone takes up 200 kB 
> in the Xerces library.
> internal/IANAEncodings.hpp contains const XMLCh gEncodingArray[791][128]. This 
> uses sizeof(XMLCh)*791*128 or about 200000 bytes. Most of the names are shorter 
> than 15 or so characters, and only ASCII characters are ever used in IANA 
> charset names. The names should therefore be stored as ASCII bytes, and only as 
> many per name as necessary.
> As a simpler means of making this array smaller, the IANA charset registration 
> imposes an upper limit of 40 characters for charset names. There are only two 
> registered names that violate this (I think), they could be safely omitted. Add 
> space for the NUL. 128 characters per name is way overkill.
> I also wonder whether this list is useful at all. Xerces only verifies that a 
> name exists in the list. It does not verify that it has a converter for it 
> (other than failing to open it, which does not use this list). It cannot verify 
> that what the XML document claims its charset is matches the converter that 
> Xerces is going to open for this name (e.g., mismatches between Shift-JIS etc. 
> among Windows/Unix/mainframe, see W3C Japanese profile for XML).
> I suggest to add a compile-time option (#ifdef) to remove the IANA charset name 
> list (#ifdef out the use of EncodingValidator in util/TransService.cpp).
> Note that ICU4C 2.2+ has data structures and APIs for dealing with charset 
> names associated with various standards (like IANA) and platforms. ICU4C does 
> not have a complete list of IANA names, but this is a matter of adding them to 
> its convrtrs.txt, not a real implementation issue.
> Best regards,
> markus

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
If you want more information on JIRA, or have a bug to report see:

To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

View raw message