xerces-c-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rich Taylor <rtay...@webmd.net>
Subject RE: Multilanguage filename
Date Wed, 29 Nov 2000 21:11:35 GMT

I'm not setup to debug this for you, but it sounds (from my porting
experience) like standard DBCS problems.

Japanese Win98 supports standard SBCS (single-byte character set)
data using the Windows ANSI char set and DBCS (double-byte char set)
using the SHIFT-JIS character set.  Your Japanese characters which
are causing problems are DBCS characters.  (Actually, the Japanese
environment slightly changes the SBCS Windows ANSI charset by
replacing the backslash with the Yen sign - though, it still works
in the o/s as the directory separation character.)

The usual problem dealing with DBCS strings is the fact that the
character count (#chars) is not the same as the byte count (#bytes).
(In the SBCS world, #chars = #bytes.)  So, the developer has to
be very careful about whether counts are characters or bytes.
The strlen() function only counts bytes.  The _mbslen() function
counts characters, but this is a non-standard function available
in Microsoft's C Run-time Library, but not necessarily anywhere
else.  If you find this kind of problem you'll have to consult
with Unix developers to fix it in a portable fashion!

So, either by tracing or through code inspection you're looking for
somewhere that the code calculates one length (bytes or chars) and
then passes it to a routine which is expecting the other length.

A second, more subtle problem with handling SHIFT_JIS character
set filenames under Windows is that this character set allows the
backslash (0x5C) and vertical bar (0x5A) SBCS characters as the
second byte of a few double-byte characters.  If your code is
parsing backwards to find a backslash character in the filepath
(ex. to separate the path from the file name) then you can have
trouble if the last character of the directory is a double-byte
character ending in the 0x5C byte and your algorithm is parsing
by bytes instead of characters.  (The way to setup this bug
is to put files in a directory name that ends with the double-byte
"So" Katakana character.)  So (pun?), be sure to parse through
characters instead of bytes to be safe with DBCS.

The final "surprise" I know of is rare but possible (and it took
a while to find).  When converting between Unicode and SHIFT_JIS
there are several characters that are duplicated in the SHIFT_JIS
character set.  So, two different code points in Japanese map
to just one codepoint in Unicode.  The Unicode back to SHIFT_JIS
conversion only converts back to one of the two original
codepoints.  So a "roundtrip" conversion of SJIS->Unicode->SJIS
can actually change some characters.  Microsoft's KnowledgeBase
article Q170559 details SHIFT_JIS codepoints affected by this

Good luck!

- Rich Taylor, WebMD

> -----Original Message-----
> From: kbagepalli@informatica.com [mailto:kbagepalli@informatica.com]
> Sent: Wednesday, November 29, 2000 2:54 PM
> To: xerces-c-dev@xml.apache.org
> Subject: RE: Multilanguage filename
> Also this problem happens only on Jap Win 98. I tried with xerces
> 1.2a & icu
> 1.4. Also with xerces 1.3 and icu 1.6
> Kiran
> -----Original Message-----
> From: Bagepalli, Kiran
> Sent: Wednesday, November 29, 2000 11:09 AM
> To: xerces-c-dev@xml.apache.org
> Subject: Multilanguage filename
> We have a problem with parsing Japanese XML documents. The problem is with
> the file name not the data itself!
> If the file name is in English (with the data in Japanese - we are using
> icu) it parses fine. However if the file name
> is in Japanese it says that the primary document could not be opened.
> Any clues on what is going wrong.
> Kiran
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

View raw message