gump-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam R. B. Jack" <aj...@apache.org>
Subject Re: Unicode & Python interacting with File Systems
Date Wed, 06 Oct 2004 17:57:05 GMT
Ok, so a quick test program:

----------------------------------------------------------------------
import sys
import os

print 'Default File System Encoding: ' + sys.getdefaultencoding()

for name in os.listdir('../../workspace/cvs/xom/data'):
        if name.startswith('r'): print 'Non-Unicode : ' + `name`

for name in os.listdir(u'../../workspace/cvs/xom/data'):
        if name.startswith('r'): print 'Unicode : ' + `name`
----------------------------------------------------------------------
Gives:

Default File System Encoding: ascii

Non-Unicode : 'rddltest.html'
Non-Unicode : 'resum\xc3\xa9.xml'
Unicode : u'rddltest.html'
Unicode : 'resum\xc3\xa9.xml'

i.e. listdir is returning unicode strings when passed a unicode directory,
except in this case, where it returns a simple string. As you see above, it
seems that the default file system encoding is ascii, so somehow when this
filename is encountered the logic is stumbling. Can the Linux file system
not cope with unicode characters, or is "ascii" wrong as a default system
encoding? Heck, I can't easily ls this file simply either (on my terminal):

     ls ../../workspace/cvs/xom/data/resumé.xml
    ../../workspace/cvs/xom/data/resum??.xml

Is the problem that XOM is (on some platform) encoding this filename,
checking it in to CVS, and when CVS (on Brutus) checks it out, it is
knobbling the directory creation? Do we have a general problem with CVS|SVN
here?

Anybody have suggestions on where I go next with this? I'd like to solve it
[short and/or long term], but I'd also like to understand where the issue
really is, since we might have a general problem here.

[This seems a good question to post to a group like python@apache.org,
except we don't have one. Ok, whine over... ]

regards,

Adam
----- Original Message ----- 
From: "Sam Ruby" <rubys@apache.org>
To: "Gump code and data" <general@gump.apache.org>
Sent: Tuesday, October 05, 2004 5:18 PM
Subject: Re: Unicode & Python interacting with File Systems


> Adam Jack wrote:
>
> > 1) Is "\xC3\xA9" a 'single' unicode character already? Is it being
> > considered as two by accident?
>
> unicode("\xC3\xA9","utf-8") == u'\xe9'
>
> http://www.unipad.org/unimap/index.php?page=detail&param_char=00E9
>
> - Sam Ruby
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
> For additional commands, e-mail: general-help@gump.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org


Mime
View raw message