apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Leffler <jonathan.leff...@gmail.com>
Subject Re: Q to unix filesystem developers
Date Thu, 14 Apr 2011 23:00:18 GMT
On Thu, Apr 14, 2011 at 13:04, William A. Rowe Jr. <wrowe@rowe-clan.net>wrote:

> With some multibyte character sets, it may be possible that '/' is one
> byte of a multibyte sequence.  From a Unix perspective, I presume that
> it is always treated a path separator and never treated as a multibyte
> combination filename character.
> But I just wanted to ask in case anyone is aware of where this might
> treated as a valid filename character?

Wikipedia on Shift-JIS (http://en.wikipedia.org/wiki/Shift_JIS) says:

*Shift JIS* (also *SJIS*, MIME <http://en.wikipedia.org/wiki/MIME> name *
Shift_JIS*) is a character
encoding<http://en.wikipedia.org/wiki/Character_encoding>for the
language <http://en.wikipedia.org/wiki/Japanese_language> originally
developed by a Japanese <http://en.wikipedia.org/wiki/Japan> company
called ASCII
Corporation <http://en.wikipedia.org/wiki/ASCII_%28company%29> in
conjunction with Microsoft <http://en.wikipedia.org/wiki/Microsoft> and
standardized as *JIS X 0208 Appendix 1*. It is based on character sets
defined within JIS<http://en.wikipedia.org/wiki/Japanese_Industrial_Standards>standards
X 0201 <http://en.wikipedia.org/wiki/JIS_X_0201>:1997 (for the single-byte
characters) and JIS X 0208 <http://en.wikipedia.org/wiki/JIS_X_0208>:1997
(for the double byte characters). The lead bytes for the double byte
characters are "shifted" around the 64 halfwidth
katakana<http://en.wikipedia.org/wiki/Katakana>characters in the
single-byte range 0xA1
to 0xDF <http://en.wikipedia.org/wiki/JIS_X_0201#Encoded_Katakana>. The
single-byte characters 0x <http://en.wikipedia.org/wiki/0x>00 to 0x7F match
the ASCII <http://en.wikipedia.org/wiki/ASCII> encoding, except for a
yen<http://en.wikipedia.org/wiki/Japanese_yen>sign at 0x5C and an
overline at 0x7E in place of the ASCII character set's
backslash and tilde respectively. The single-byte characters from 0xA1 to
0xDF map to the half-width katakana characters found in JIS X 0201.

Shift JIS requires an 8-bit clean
<http://en.wikipedia.org/wiki/8-bit_clean>medium for transmission. It
is fully backwards
compatible <http://en.wikipedia.org/wiki/Backward_compatibility> with the
legacy JIS X 0201 <http://en.wikipedia.org/wiki/JIS_X_0201> single-byte
encoding <http://en.wikipedia.org/wiki/Single-byte_encoding>, meaning it
supports half-width
katakana<http://en.wikipedia.org/wiki/Half-width_katakana>and that any
valid JIS X 0201 string is also a valid Shift JIS string. For
two-byte characters, however, Shift JIS only guarantees that the first byte
will be high bit set (0x80–0xFF); the value of the second byte can be either
high or low. Appearance of byte values 0x40–0x7E as second bytes of code
words <http://en.wikipedia.org/wiki/Code_word> makes reliable Shift JIS
detection difficult, because same codes are used for ASCII characters. On
the other hand, the competing 8-bit format
which does not support single-byte halfwidth katakana, allows for a much
cleaner and direct conversion to and from JIS X 0208 code
as all high bit set bytes are parts of a double-byte character and all codes
from ASCII range represent single-byte characters.

Given that the second byte is in the range 0x40..0x7E (second para), and /
is 0x2F, there shouldn't be a problem with Shift-JIS.  That's not to say
there isn't another codeset where there isn't a problem, but I don't think
it is Shift-JIS and possibly not any of the main Japanese codesets.

Jonathan Leffler <jonathan.leffler@gmail.com>  #include <disclaimer.h>
Guardian of DBD::Informix - v2008.0513 - http://dbi.perl.org
"Blessed are we who can laugh at ourselves, for we shall never cease to be

View raw message