axis-java-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toshiyuki Kimura <to...@apache.org>
Subject Re: UTF8Encoder question...
Date Wed, 19 Jan 2005 09:13:25 GMT
Hi Jongjin,

Let me clarify ...
Is the switch for only Admin Service and Client, for app global,
or for per each apps ?

   On the i18n point of view, I hope Axis works fine any time with
all of languages by using the default settings.

Thanks,
Toshi <toshi@apache.org>

On Wed, 19 Jan 2005, Jongjin Choi wrote:

> Hi, Toshi and all.
> 
> I'd like to propose these for backward compatibility:
>   - keep the escaping as default
>   - make a runtime option (axis property in wsdd) for switching to
>     no-escaping.
> 
> The current behavior has no problem for an application to handle the
> soap message. I just pointed that the message size can be somewhat
> larger with escaping.
> 
> But in this case, the admin client (AdminClient.java) seems to writes
> the content of soap body directly to console. I think the switch can
> be applied to Admin Service and Client.
>
> Any thought?
>
> /Jongjin
>
> ----- Original Message -----
> From: "Toshiyuki Kimura" <toshi@apache.org>
> To: <axis-dev@ws.apache.org>
> Cc: "Changshin Lee" <iasandcb@gmail.com>;
> "Jongjin Choi" <gunsnroz@hotmail.com>
> Sent: Wednesday, January 19, 2005 12:41 PM
> Subject: Re: UTF8Encoder question...
>
>
>> Hi Ias, Jongjin and all,
>>
>>   Sorry for the cutting in. I'd like to know the conclusion.
>>
>>   As you may know, I'm now working for i18n of Axis. Then, the
>> Japanese Axis Community has already made a Japanized resources.
>> On the testing, I faced an encoding problem of UTF-8.
>>
>>   With the latest CVS codes, I get a escaping message from the
>> server-side Axis as follows;
>>
>>   <Admin>&#x51E6;&#x7406;&#x3092;&#x5B9F;&#x884C;&#x3057;&#x307E;
>>   &#x3057;&#x305F;/ [en]-(Done processing)</Admin>
>>
>> instead of
>>
>>   <Admin>[Japanese Message] / [en]-(Done processing)</Admin>
>>
>>  As a side node, I could have valid Japanese characters when I
>> applied a patch of Jongjin to my local 'UTF8Encoder.java'.
>>
>> Any thought?
>>
>> Regards,
>> Toshi <toshi@apache.org>
>>
>> On Thu, 30 Dec 2004, Changshin Lee wrote:
>>
>>>> Ias and all,
>>>>
>>>> If you revive the commented and removed code of UTF8Encoder that is :
>>>>                     /*
>>>> TODO: Try fixing this block instead of code above.
>>>>                         if (character < 0x80) {
>>>>                             writer.write(character);
>>>>                         } else if (character < 0x800) {
>>>>                             writer.write((0xC0 | character >> 6));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x10000) {
>>>>                             writer.write((0xE0 | character >> 12));
>>>>                             writer.write((0x80 | character >> 6 &
0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x200000) {
>>>>                             writer.write((0xF0 | character >> 18));
>>>>                             writer.write((0x80 | character >> 12 &
0x3F));
>>>>                             writer.write((0x80 | character >> 6 &
0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         }
>>>>                         */
>>>> and uncommented current escaping code, the all-tests will fail.
>>>> As I addressed, these code would be necessary for OutputStream not Writer.
>>>> In this case the Writer is used and  the code can be simply rewrited (as
in UTF16Encoder)
>>>>
>>>> writer.write(character);
>>>>
>>>> I think the all-tests will succeed. (I can verify this now because current
CVS all-tests fails.)
>>>>
>>>
>>> Could you run all-tests except those failed chronically (by adding
>>> them to excluded list)? If the result is clean, I'm on the change (and
>>> it's easy to revert as well, so commit it :-).
>>>
>>>> For readability of SOAP message, I think it is not the responsibility of
Axis.
>>>
>>> Human readability is one of essenses in XML (and SOAP). Assuming that
>>> a SOAP processor processes a SOAP input message readable to a user,
>>> then the output of the processing as a form of SOAP must be readable
>>> to the user. Therefore when people use Axis as a SOAP processor, they
>>> will blame Axis for a result containing unreadably broken characters
>>> to them. It's not utterly up to Axis, but Axis can cause it, and Axis
>>> should guarantee that there's no distortion in terms of readability
>>> from Alpha to Omega of SOAP processing.
>>>
>>> Ias
>>>
>>>>
>>>> This is the diff:
>>>> cvs diff -u UTF8Encoder.java
>>>> Index: UTF8Encoder.java
>>>> ===================================================================
>>>> RCS file: /home/cvspublic/ws-axis/java/src/org/apache/axis/components/encoding/UTF8Encoder.java,v
>>>> retrieving revision 1.4
>>>> diff -u -r1.4 UTF8Encoder.java
>>>> --- UTF8Encoder.java 4 Nov 2004 18:23:12 -0000 1.4
>>>> +++ UTF8Encoder.java 30 Dec 2004 01:20:03 -0000
>>>> @@ -82,10 +82,6 @@
>>>>                                  "invalidXmlCharacter00",
>>>>                                  Integer.toHexString(character),
>>>>                                  xmlString));
>>>> -                    } else if (character > 0x7F) {
>>>> -                        writer.write("&#x");
>>>> -                        writer.write(Integer.toHexString(character).toUpperCase());
>>>> -                        writer.write(";");
>>>>                      } else {
>>>>                          writer.write(character);
>>>>                      }
>>>>
>>>>
>>>> /Jongjin
>>>>
>>>> ----- Original Message -----
>>>> From: "Changshin Lee" <iasandcb@gmail.com>
>>>> To: <axis-dev@ws.apache.org>
>>>> Sent: Thursday, December 30, 2004 1:20 AM
>>>> Subject: Re: UTF8Encoder question...
>>>>
>>>>> Ias,
>>>>>
>>>>> Even if we consider the system which can't display the soap message well
 for its lack of unicode-font,
>>>>> I think the default encoding should be as-it-is not scaping.
>>>>>
>>>>> The soap message is not for display and it is better to generate the
more compact soap message from the web services toolkit's point of view.
>>>>>
>>>>
>>>> SOAP messages are not for presentation but should be readable :-)
>>>>
>>>>> For displaying, the application can convert the soap message to appropriate
encoding. (as you know, here in korea, we use euc-kr. and also as you know, the conversion
can be possible with some line of java code.)
>>>>> Also, as far as I know,  Axis used as-it-is way in Axis 1.0 or 1.1.
>>>>>
>>>>
>>>> That's a good point. However, we need to pay attention to those may
>>>> want UTF8Encoder to run conversion like now. If we revert Axis 1.2's
>>>> UTF8Encoder, we should inform users of the regression clearly in order
>>>> not to puzzle them.
>>>>
>>>>> I remember that the reason to use scaping in UTF8Encoder was to handle
the french accent or german umlaut a few months ago. This is reflected in test.encoding.TestString
test case.
>>>>>
>>>>
>>>> The current mechanism came up in April. At the moment
>>>>
>>>> TODO: Try fixing this block instead of code above.
>>>>                         if (character < 0x80) {
>>>>                             writer.write(character);
>>>>                         } else if (character < 0x800) {
>>>>                             writer.write((0xC0 | character >> 6));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x10000) {
>>>>                             writer.write((0xE0 | character >> 12));
>>>>                             writer.write((0x80 | character >> 6 &
0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         } else if (character < 0x200000) {
>>>>                             writer.write((0xF0 | character >> 18));
>>>>                             writer.write((0x80 | character >> 12 &
0x3F));
>>>>                             writer.write((0x80 | character >> 6 &
0x3F));
>>>>                             writer.write((0x80 | character & 0x3F));
>>>>                         }
>>>>                         */
>>>>
>>>> but the commented part was gone in 1_2RC2 tag.
>>>>
>>>>> Any thought?
>>>>>
>>>>
>>>> So, what you're saying is that the current UTF8Encoder's behavior
>>>> comes from the test case. In other words, if you change the encoder to
>>>> output "as-it-is", then the test fails. Could we make them consistent,
>>>> I mean, UTF8Encoder outputs without conversion and at the same time
>>>> the case passes?
>>>>
>>>> Ias
>>>>
>>>> P.S. I'd like to hear opinions on changing UTF8Encoder's default
>>>> behavior (and possibly create another encoder or an option for
>>>> conversion). Once we pass all tests with the changed encoder, it is
>>>> worth adopting the change, I believe.
>>>>
>>>>> /Jongjin
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Ias" <iasandcb@hotmail.com>
>>>>> To: <axis-dev@ws.apache.org>
>>>>> Sent: Wednesday, December 29, 2004 1:53 AM
>>>>> Subject: RE: UTF8Encoder question...
>>>>>
>>>>>>
>>>>>> From: Jongjin Choi [mailto:gunsnroz@hotmail.com]
>>>>>> Sent: Tuesday, December 28, 2004 11:56 AM
>>>>>> To: axis-dev@ws.apache.org
>>>>>> Subject: UTF8Encoder question...
>>>>>>
>>>>>>
>>>>>> Dims and all,
>>>>>>
>>>>>> UTF8Encoder writes escaped string when the character is over 0x7F.
>>>>>> The escaping does not seem to be necessary because
>>>>>> the Writer (not OutputStream) is used.
>>>>>>
>>>>>> I think this could be just : (line 86)
>>>>>>
>>>>>> writer.write(character);
>>>>>>
>>>>>> instead of : (line 86 ~ 88)
>>>>>> writer.write("&#x);
>>>>>> writer.write(Integer.toHexString(character).toUpperCase());
>>>>>> writer.write(";");
>>>>>>
>>>>>> The escaping just increases the message size.
>>>>>>
>>>>> ias> Yes, it does. However, I think representing a character of which
codepoint
>>>>> ias> is over 0x7F as a form of &#x XML entity is one of the aims
of the encoder
>>>>> ias> because some systems can't display that character properly due
to no
>>>>> ias> unicode-wide fonts built in there. In case it's 100% certain
that every node
>>>>> ias> in a messaging system has no problem with "as-it-is" character
>>>>> ias> representation on a XML instance, it must be much more efficient
to use a
>>>>> ias> compact encoder as you pointed out instead of UTF8Encoder. Interestingly,
>>>>> ias> AbstractXMLEncoder (which is not instantiable) works in such
a way. In
>>>>> ias> consequence, it would be a good idea to create a new encoder
to optimize
>>>>> ias> message size and use it with ease of configurability. (Yes, we
can recommend
>>>>> ias> it to users dealing with non-Latin character systems :-)
>>>>>>
>>>>>> Happy new year,
>>>>>>
>>>>>> Ias
>>>>>>
>>>>>> P.S. I'm going to switch iasandcb@hotmail.com to iasandcb@gmail.com
(soon,
>>>>>> very soon).
>>>>>>
>>>>>>
>>>>>> If the OutputStream is used, the escaping or UTF-8 conversion (which
>>>>>> existed in old UTF8Encoder.java) will be needed.
>>>>>>
>>>>>> Thought?
>>>>>>
>>>>>> /Jongjin
>>>>>>
>>>>>>
>>>>
>>>
>>

Mime
View raw message