hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Sutton <adr...@intencha.com>
Subject Re: Encoding of special characters in request URI
Date Thu, 10 Jul 2003 22:59:56 GMT
 From RFC2396:

----
    For original character sequences that contain non-ASCII characters,
    however, the situation is more difficult. Internet protocols that
    transmit octet sequences intended to represent character sequences
    are expected to provide some way of identifying the charset used, if
    there might be more than one [RFC2277].  However, there is currently
    no provision within the generic URI syntax to accomplish this
    identification. An individual URI scheme may require a single
    charset, define a default charset, or provide a way to indicate the
    charset used.

    It is expected that a systematic treatment of character encoding
    within URI will be developed as a future modification of this
    specification.
-----

So there's no right answer here.  The IETF seems to be moving towards 
using UTF-8 as the international charset so we may as well use it.  I 
have been unable to find a browser that can correctly handle anything 
outside of ISO8859-1 charset however - double byte characters are a 
really great way to screw things up.

So in essence - don't put non-ASCII characters in URLs there is no 
official way to support them.  We should however give it a shot by 
using UTF-8 since it is "compatible" with ASCII anyway.

Regards,

Adrian Sutton.

On Friday, July 11, 2003, at 03:11  AM, Oleg Kalnichevski wrote:

> This is one of many 'shady' areas of the HTTP spec. Basically there is
> no standard way for the client to communicate to the server what coding
> has been used to decode query parameters. I believe some browsers use
> 'Accept-charset" or 'Accept-Language' headers to negotiate the locale
> settings to be used by the server. But I am not sure it these headers
> can be used to determine what character coding can be used to decode
> URL-encoded data.
>
> I think we definitely should not be using US-ASCII per default. The
> whole point of URL encoding is to escape non-ASCII characters. I 
> suggest
> UTF-8 be used per default.
>
> Oleg
>
>
>
> On Thu, 2003-07-10 at 17:48, Michael Becke wrote:
>> Hello Martin,
>>
>> This is a good question, one that I am not positive I know the answer
>> to.  The HTTP request line (containing the query params) must be
>> US-ASCII.  That I am sure of.  The catch is that form urlencoding
>> strings makes them ASCII, regardless of the original charset.  So
>> HttpMethod.setQueryString(NameValuePair[]) is assuming that the
>> inputs(query params) are ASCII when really only the output(encoded
>> params) should be ASCII.
>>
>> The question is how does one determine, on the client and the server,
>> what the charset of the query params is?  The request charset can be
>> specified with the Content-Type header, but this is meant to apply to
>> the request entity, not the headers.  I have a feeling that we should
>> probably be using the content charset anyway.  My reasoning here is 
>> that
>> an HTML form can be sent via a GET(query params) or POST(post 
>> content).
>>   In both cases the content must be form urlencoded and my feeling is
>> that it should be done the same for both.
>>
>> What does everyone else think?
>>
>> Mike
>>
>> Martin Schnyder wrote:
>>> When I use the GetMethod class to send text with special characters 
>>> (German
>>> Umlaute "äöü") in the request parameters, the special characters are 
>>> not
>>> encoded correctly. This happens when I use method
>>> HttpMethodBase.setQueryString(NameValuePair[] params)
>>> to set the query parameters.
>>>
>>> I saw that Release 2.0 Beta 2 fixed that with bug fix 20481. Special
>>> characters are now encoded differently but still wrong, as far as I 
>>> can see.
>>>
>>> Method HttpMethodBase.setQueryString(NameValuePair[]) calls
>>> formUrlEncode(params, HttpConstants.HTTP_ELEMENT_CHARSET) to encode 
>>> the
>>> parameters. The value of HTTP_ELEMENT_CHARSET is US-ASCII. When I 
>>> change the
>>> charset to HttpConstants.DEFAULT_CONTENT_CHARSET (which is 
>>> ISO-8859-1), the
>>> German "Umlaute" are encoded correctly. I checked that with the code 
>>> in CVS
>>> HEAD. Is this a bug or should really only the US-ASCII characters be
>>> supported in a request URI?
>>>
>>> Regards,
>>> Martin Schnyder
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: 
>>> commons-httpclient-dev-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: 
>>> commons-httpclient-dev-help@jakarta.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 
>> commons-httpclient-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: 
>> commons-httpclient-dev-help@jakarta.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> commons-httpclient-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: 
> commons-httpclient-dev-help@jakarta.apache.org
>


Mime
View raw message