From dev-return-37695-apmail-harmony-dev-archive=harmony.apache.org@harmony.apache.org Thu Jul 16 06:28:17 2009 Return-Path: Delivered-To: apmail-harmony-dev-archive@www.apache.org Received: (qmail 76654 invoked from network); 16 Jul 2009 06:28:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Jul 2009 06:28:17 -0000 Received: (qmail 14701 invoked by uid 500); 16 Jul 2009 06:29:22 -0000 Delivered-To: apmail-harmony-dev-archive@harmony.apache.org Received: (qmail 14617 invoked by uid 500); 16 Jul 2009 06:29:21 -0000 Mailing-List: contact dev-help@harmony.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@harmony.apache.org Delivered-To: mailing list dev@harmony.apache.org Received: (qmail 14606 invoked by uid 99); 16 Jul 2009 06:29:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2009 06:29:21 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of littlee1032@gmail.com designates 209.85.210.203 as permitted sender) Received: from [209.85.210.203] (HELO mail-yx0-f203.google.com) (209.85.210.203) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2009 06:29:10 +0000 Received: by yxe41 with SMTP id 41so6575708yxe.20 for ; Wed, 15 Jul 2009 23:28:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=O475yUp5x7T3xJlNpyn3YSNEO1xXtSKARkaQvnFIGgc=; b=qMtrpiCY345p8gW8mIFEMjmcihN6EJ9lYZstO4t3SI7szfsv0zwy5C6h8093qq1VdN 6kfwkjQBejX8HNcEipYtj7vEo+GELjtyQRa2RPrFfICp2ATu21WeIf+OZf/kIHL4tw5e FJ8rtShofGTmhx6TXSxklGvzSoYuluC5PD8S0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=eZ7r6Ecit4dlDGMU3SHY6aL+y8sd2lcHBftkJzot/C8Jw6oSYdDEgBP25l6cerBA6z xLMauiVzhtSoj9oO79MBaQrNltHvbbODQWhc3JwqGXOlK1hB+k6KYMnsngiNR3DExs2B ClrN8GSnOYQ8cWp/BA3G/3OQhH8y+orBw7w7s= MIME-Version: 1.0 Received: by 10.100.165.13 with SMTP id n13mr9011209ane.18.1247725729157; Wed, 15 Jul 2009 23:28:49 -0700 (PDT) In-Reply-To: <5948b71e0907142312x446dc760of97276a3c69d2e97@mail.gmail.com> References: <5948b71e0907140250h2ba70787mec98fb2295baa5eb@mail.gmail.com> <5948b71e0907140351j4683e031q6861a33a5073eac8@mail.gmail.com> <222B26C1-4029-4832-AC33-DC4C78DBD368@gmail.com> <5948b71e0907140839k58a391fan54f4a477de1bca9c@mail.gmail.com> <4A5D381D.3090709@gmail.com> <70c713190907141928p1e20ca8dt9ffb05b7ab7bca88@mail.gmail.com> <4A5D684F.6010401@gmail.com> <3b3f27c60907142243r2df33a30m37cc82ebc66f03ac@mail.gmail.com> <4A5D6FF8.6080004@gmail.com> <5948b71e0907142312x446dc760of97276a3c69d2e97@mail.gmail.com> Date: Thu, 16 Jul 2009 14:28:49 +0800 Message-ID: <5948b71e0907152328y431556b1i153e9bce355a2d33@mail.gmail.com> Subject: Re: Shall we change our file.encoding From: Charles Lee To: dev@harmony.apache.org Content-Type: multipart/mixed; boundary=0016e640d340dd1440046eccc846 X-Virus-Checked: Checked by ClamAV on apache.org --0016e640d340dd1440046eccc846 Content-Type: multipart/alternative; boundary=0016e640d340dd1436046eccc844 --0016e640d340dd1436046eccc844 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi guys, I have add the locale function in the drlvm, the patch is attached. Please try this new patch on the linux. The patch should work on the linux but fail on the windows. Because windows returns code page not charset from the setlocale. I hv tried long time to get the charset name from the codepage, for example: CPINFOEX cpInfoEx; BOOL iReturn = GetCPInfoEx(CP_ACP,0, &cPInfoEx); if (iReturn > 0) { printf("FULL NAME %s\n", cPinfoEx,CodePageName); } But I only get the full name without any format. There is code page identifiers map in the msdn, detail here. I may hard code this map in the file. But the note on the msdn says: "ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page." I am afraid hard-code will fail on some machines. (By the way, this seems the UTF-8 is suggested to be the default again :-) There is also a class Encoding in the VC++, detail here. But we can not use it here. So anyone knows some thing about locale on the windows? Again, shall use UTF-8 as our default? On Wed, Jul 15, 2009 at 2:12 PM, Charles Lee wrote: > That seems we should add it in the drlvm. > > > On Wed, Jul 15, 2009 at 1:58 PM, Regis wrote: > >> Nathan Beyer wrote: >> >>> Is the IBM VME dealing with this correctly? Do we just need to fix DRLVM? >>> >> >> Yes, I only tested on Linux, IBM VME set the property correctly. >> >> >> >>> On Wed, Jul 15, 2009 at 12:25 AM, Regis wrote: >>> >>>> Kevin Zhou wrote: >>>> >>>>> Yea, from luniglob.c, CL attempts to read the "file.encoding" property >>>>> adown >>>>> VM but fails to get the correct encoding. >>>>> >>>>> Regis, do you know any other specific ways that CL can gain the right >>>>> property? >>>>> >>>> We can get from OS directly. Maybe just read env variables on Linux? >>>> >>>> Wed, Jul 15, 2009 at 9:59 AM, Regis wrote: >>>>> >>>>> Charles Lee wrote: >>>>>> >>>>>> Hi Nanthan, >>>>>>> >>>>>>> If the file encoding derive from the OS, it should be the some bugs >>>>>>> in >>>>>>> it >>>>>>> because on my LINUX machine the locale is en_US.UTF-8. Our default >>>>>>> codec >>>>>>> is >>>>>>> still ISO8859-1. Do you know where can we found such codes? >>>>>>> >>>>>>> Classlib expected vm do this and set the property, but it didn't, so >>>>>> we >>>>>> have to do this by ourselves. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer >>>>>>> wrote: >>>>>>> >>>>>>> Are we talking about windows or linux?the default file encoding >>>>>>> should >>>>>>> >>>>>>>> derive from the OS. I believe that's defined by the specs. >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>> >>>>>>>> On Jul 14, 2009, at 5:51 AM, Charles Lee >>>>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv >>>>>>> > >>>>>>>> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Charles, I believe UTF-8 is the default encoding for RI, and it >>>>>>>>>> sounds >>>>>>>>>> reasonable. >>>>>>>>>> BTW, it may encounter some compatibility problem, maybe we need >>>>>>>>>> to >>>>>>>>>> run >>>>>>>>>> more tests to verify? >>>>>>>>>> >>>>>>>>>> 2009/7/14 Charles Lee >>>>>>>>>> >>>>>>>>>> Hi guys: >>>>>>>>>> >>>>>>>>>> I am doing some test cases on the ant junit test case and meeting >>>>>>>>>>> some >>>>>>>>>>> encoding problems. I find they are maybe caused by the different >>>>>>>>>>> default >>>>>>>>>>> encoding from RI and harmony. My local is en_US.UTF-8, RI default >>>>>>>>>>> is >>>>>>>>>>> >>>>>>>>>>> UTF-8 >>>>>>>>>>> >>>>>>>>>> but harmony is 8859-1. And then I have encountered >>>>>>>>>> >>>>>>>>>>> HARMONY-3736>>>>>>>>>> >, >>>>>>>>>>> and the two diffs attached on that issue. It seems we always get >>>>>>>>>>> 8859-1. >>>>>>>>>>> Because: (correct me if wrong :-) >>>>>>>>>>> >>>>>>>>>>> 1. we remove the set code in the vm. we will always get null if >>>>>>>>>>> we >>>>>>>>>>> call >>>>>>>>>>> >>>>>>>>>>> vm >>>>>>>>>>> >>>>>>>>>> method >>>>>>>>>> >>>>>>>>>>> 2. we set the file.encode in the libglob.c, if we got null from >>>>>>>>>>> vm, >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> set >>>>>>>>>>> >>>>>>>>>> Sorry, it should be luniglob.c >>>>>>>>>> >>>>>>>>>> 8859-1. >>>>>>>>> >>>>>>>>>> 3. we can not set file.encode on the run time. >>>>>>>>>>> >>>>>>>>>>> ant use UTF-8 to encode filename which contains the non-ascii >>>>>>>>>>> character. >>>>>>>>>>> So why we use iso8859-1 as our unchangeable default? >>>>>>>>>>> From the wiki http://en.wikipedia.org/wiki/ISO8859-1, it says >>>>>>>>>>> "In >>>>>>>>>>> computing >>>>>>>>>>> applications, encodings that provide full UCS support (such as >>>>>>>>>>> UTF-8and >>>>>>>>>>> UTF-16 ) are finding >>>>>>>>>>> increasing >>>>>>>>>>> >>>>>>>>>>> favor >>>>>>>>>>> >>>>>>>>>> over encodings based on ISO 8859-1." Should we simply change >>>>>>>>>> iso8859-1 >>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> utf-8? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Yours sincerely, >>>>>>>>>>> Charles Lee >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Best Regards! >>>>>>>>>> >>>>>>>>>> Jimmy, Jing Lv >>>>>>>>>> China Software Development Lab, IBM >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> Yours sincerely, >>>>>>>>> Charles Lee >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>> Best Regards, >>>>>> Regis. >>>>>> >>>>>> >>>> -- >>>> Best Regards, >>>> Regis. >>>> >>>> >>> >> >> -- >> Best Regards, >> Regis. >> > > > > -- > Yours sincerely, > Charles Lee > > -- Yours sincerely, Charles Lee --0016e640d340dd1436046eccc844 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi guys,

I have add the locale function in the drlvm, the patch is a= ttached. Please try this new patch on the linux.

The patch should wo= rk on the linux but fail on the windows. Because windows returns code page = not charset from the setlocale. I hv tried long time to get the charset nam= e from the codepage, for example:
CPINFOEX cpInfoEx;
BOOL iReturn =3D GetCPInfoEx(CP_ACP,0, &cPInfoEx)= ;
if (iReturn > 0) {
=A0=A0=A0 printf("FULL NAME %s\n", = cPinfoEx,CodePageName);
}
But I only get the full name without any fo= rmat.

There is code page identifiers map in the msdn, detail here. I ma= y hard code this map in the file. But the note on the msdn says:
=A0=A0= =A0=A0 "ANSI code pages can be different on different computers, or ca= n be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page."
I am afraid hard-code wil= l fail on some machines. (By the way, this seems the UTF-8 is suggested to = be the default again :-)

There is also a class Encoding in the VC++,= detail here. But we can not use it here.

So anyone knows some thing about locale on the windows?
Again, shall= use UTF-8 as our default?

On Wed, Jul 15= , 2009 at 2:12 PM, Charles Lee <littlee1032@gmail.com> wrote:
That seems we sho= uld add it in the drlvm.


On Wed, Jul 15, 2009 at 1:58 PM, Regis <xu.regis@gmail.com>= wrote:
Nathan Beyer wrote:
Is the IBM VME dealing with this correctly? Do we just need to fix DRLVM?

Yes, I only tested on Linux, IBM VME set the property correctly.
<= /div>



On Wed, Jul 15, 2009 at 12:25 AM, Regis<xu.regis@gmail.com> wrote:
Kevin Zhou wrote:
Yea, from luniglob.c, CL attempts to read the "file.encoding" pro= perty
adown
VM but fails to get the correct encoding.

Regis, do you know any other specific ways that CL can gain the right
property?
We can get from OS directly. Maybe just read env variables on Linux?

Wed, Jul 15, 2009 at 9:59 AM, Regis <xu.regis@gmail.com> wrote:

Charles Lee wrote:

Hi Nanthan,

If the file encoding derive from the OS, it should be the some bugs in
it
because on my LINUX machine the locale is en_US.UTF-8. Our default codec is
still ISO8859-1. Do you know where can we found such codes?

Classlib expected vm do this and set the property, but it didn't, so we=
have to do this by ourselves.



On Tue, Jul 14, 2009 at 10:17 PM, Nathan Beyer <nbeyer@gmail.com> wrote:

=A0Are we talking about windows or linux?the default file encoding should
derive from the OS. I believe that's defined by the specs.

Sent from my iPhone


On Jul 14, 2009, at 5:51 AM, Charles Lee <littlee1032@gmail.com> wrote:

=A0On Tue, Jul 14, 2009 at 6:12 PM, Jimmy,Jing Lv <firepure@gmail.com>

wrote:

=A0Hi,

=A0Charles, I believe UTF-8 is the default encoding for RI, and it
sounds
reasonable.
=A0BTW, it may encounter some compatibility problem, maybe we need to
run
more tests to verify?

2009/7/14 Charles Lee <littlee1032@gmail.com>

=A0Hi guys:

I am doing some test cases on the ant junit test case and meeting
some
encoding problems. I find they are maybe caused by the different
default
encoding from RI and harmony. My local is en_US.UTF-8, RI default is

=A0UTF-8
=A0but harmony is 8859-1. And then I have encountered
HARMONY-3736<https://issues.apache.org/jira/browse/HARMONY-3736>,
and the two diffs attached on that issue. It seems we always get
8859-1.
Because: (correct me if wrong :-)

1. we remove the set code in the vm. we will always get null if we
call

=A0vm
=A0method
2. we set the file.encode in the libglob.c, if we got null from vm,
we

=A0set
=A0Sorry, it should be luniglob.c

=A08859-1.
3. we can not set file.encode on the run time.

ant use UTF-8 to encode filename which contains the non-ascii
character.
So why we use iso8859-1 as our unchangeable default?
>From the wiki
http://en.wikipedia.org/wiki/ISO8859-1, it says "In
computing
applications, encodings that provide full UCS support (such as
UTF-8<h= ttp://en.wikipedia.org/wiki/UTF-8>and
UTF-16 <http://en.wikipedia.org/wiki/UTF-16>) are finding increasing

=A0favor
=A0over encodings based on ISO 8859-1." Should we simply change
iso8859-1
to
utf-8?

--
Yours sincerely,
Charles Lee



--

Best Regards!

Jimmy, Jing Lv
China Software Development Lab, IBM



--
Yours sincerely,
Charles Lee


--
Best Regards,
Regis.


--
Best Regards,
Regis.




--
Best Regards,
Regis.



--
Your= s sincerely,
Charles Lee




--
Yours sincerely,=
Charles Lee

--0016e640d340dd1436046eccc844-- --0016e640d340dd1440046eccc846 Content-Type: text/x-patch; charset=US-ASCII; name="locale_drlvm.diff" Content-Disposition: attachment; filename="locale_drlvm.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_fx73dmvv0 SW5kZXg6IHZtL3ZtY29yZS9zcmMvaW5pdC92bV9wcm9wZXJ0aWVzLmNwcAo9PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ci0t LSB2bS92bWNvcmUvc3JjL2luaXQvdm1fcHJvcGVydGllcy5jcHAJKHJldmlzaW9uIDc5NDEzMSkK KysrIHZtL3ZtY29yZS9zcmMvaW5pdC92bV9wcm9wZXJ0aWVzLmNwcAkod29ya2luZyBjb3B5KQpA QCAtMTgsNiArMTgsNyBAQAogI2luY2x1ZGUgPGFwcl9maWxlX2lvLmg+CiAjaW5jbHVkZSA8YXBy X2ZpbGVfaW5mby5oPgogI2luY2x1ZGUgPGFwcl9zdHJpbmdzLmg+CisjaW5jbHVkZSA8bG9jYWxl Lmg+CiAjaW5jbHVkZSAicG9ydF9kc28uaCIKICNpbmNsdWRlICJwb3J0X2ZpbGVwYXRoLmgiCiAj aW5jbHVkZSAicG9ydF9zeXNpbmZvLmgiCkBAIC00NCw2ICs0NSw0NSBAQAogICAgIHJldHVybiBz dHI7CiB9CiAKKy8vIEdldCBsb2NhbGUgZnJvbSB0aGUgT1MKK2lubGluZSBjaGFyKiBsb2NhbGUo KSB7CisgIGNoYXIgKiBjb2RlYyA9IE5VTEw7CisgIGNoYXIgKiByZXQgPSBOVUxMOworICBzZXRs b2NhbGUoTENfQ1RZUEUsICIiKTsKKyAgY29kZWMgPSBzZXRsb2NhbGUoTENfQ1RZUEUsIE5VTEwp OworICAvLyBnZXQgY29kZXNldCBmcm9tCisgIC8vIGxhbmd1YWdlW190ZXJyaXRvcnldWy5jb2Rl c2V0XVtAbW9kaWZpZXJdCisgIGNoYXIgKiBsb2NhbGUgPSBuZXcgY2hhcls2NF07CisgIGludCBj dXIgPSAwOworICBib29sIGZsYWcgPSBmYWxzZTsKKyAgd2hpbGUgKCpjb2RlYykgeworICAgIGlm ICghZmxhZykgeworICAgICAgaWYgKCpjb2RlYyAhPSAnLicpIHsKKyAgICAgICAgY29kZWMrKzsK KyAgICAgICAgY29udGludWU7CisgICAgICB9IGVsc2UgeworICAgICAgICBmbGFnID0gdHJ1ZTsK KyAgICAgICAgY29kZWMrKzsKKyAgICAgIH0KKyAgICB9IGVsc2UgeworICAgICAgaWYgKCpjb2Rl YyA9PSAnQCcpIHsKKyAgICAgICAgYnJlYWs7CisgICAgICB9IGVsc2UgeworICAgICAgICBsb2Nh bGVbY3VyKytdID0gKCpjb2RlYyk7CisgICAgICAgIGNvZGVjKys7CisgICAgICB9CisgICAgfQor ICB9CisgIGxvY2FsZVtjdXJdID0gJ1wwJzsKKyAgaWYgKCFsb2NhbGUpIHsKKyAgICByZXQgPSBO VUxMOworICB9IGVsc2UgeworICAgIHJldCA9IGxvY2FsZTsKKyAgfQorICByZXR1cm4gcmV0Owor fQorCisKIC8vIGxvY2FsIG1lbW9yeSBwb29sIGZvciB0ZW1wb3JhcnkgYWxsb2NhdGlvbgogc3Rh dGljIGFwcl9wb29sX3QgKnByb3BfcG9vbDsKIApAQCAtMjQwLDYgKzI4MCw3IEBACiAgICAgcHJv cGVydGllcy5zZXRfbmV3KCJvcy5hcmNoIiwgcG9ydF9DUFVfYXJjaGl0ZWN0dXJlKCkpOwogICAg IHByb3BlcnRpZXMuc2V0X25ldygib3MudmVyc2lvbiIsIG9zX3ZlcnNpb24pOwogICAgIHByb3Bl cnRpZXMuc2V0X25ldygiZmlsZS5zZXBhcmF0b3IiLCBQT1JUX0ZJTEVfU0VQQVJBVE9SX1NUUik7 CisgICAgcHJvcGVydGllcy5zZXRfbmV3KCJmaWxlLmVuY29kaW5nIiwgbG9jYWxlKCkpOwogICAg IHByb3BlcnRpZXMuc2V0X25ldygicGF0aC5zZXBhcmF0b3IiLCBQT1JUX1BBVEhfU0VQQVJBVE9S X1NUUik7CiAgICAgcHJvcGVydGllcy5zZXRfbmV3KCJsaW5lLnNlcGFyYXRvciIsIEFQUl9FT0xf U1RSKTsKICAgICAvLyB1c2VyLm5hbWUgaW5pdGlhbGl6YXRpb24sIHRyeSB0byBnZXQgdGhlIG5h bWUgZnJvbSB0aGUgc3lzdGVtCg== --0016e640d340dd1440046eccc846--