tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Kolinko <knst.koli...@gmail.com>
Subject Re: cookie issue with Tomcat 7 - does not accept the character "é"
Date Mon, 03 Feb 2014 20:25:47 GMT
2014-02-03 André Warnier <aw@ice-sa.com>:
> André Warnier wrote:
>>
>> Chris,
>>
>> a note :
>>
>> Christopher Schultz wrote:
>> ...
>>
>>
>>>
>>> Without quoting, unquoted Cookie names and values may be any US-ASCII
>>> character from 0x32 - 0x7e except for any of ("(" | ")" | "<" | ">" |
>>> "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{"
>>> | "}" | SP | HT). None of the characters above are within that range,
>>> so the cookie value must be quoted. (It looks to me like Cookie names
>>> must always be in US-ASCII... I didn't think that was the case but I'm
>>> not motivated to track-down every word of the spec looking for
>>> justification).
>>>
>>> What is the character encoding of the request? What client are you
>>> using? Who created the cookie in the first place?
>>>
>>
>> I did the tracking down of the (tortuous) specs, and come to this :
>>
>> 1) the ISO-8859-1 character set includes "é" (Catalan and other languages)
>> (*)
>>
>> 2) the US-ASCII character set is a subset of ISO-8859-1, and does not
>> include "é".
>>
>> 3) The default character set for HTTP 1.1 is ISO-8859-1, as stated
>> explicitly and implicitly in various places in RFC 2616 [1].
>>
>> However, RFC 2616 does not define the "Cookie" nor "Set-Cookie" headers,
>> and it also does not specifically indicate which character set should be
>> used for HTTP Request/Response header values. It refers for that to MIME
>> (RFC 822), which talks only about US-ASCII.
>>
>> 2) The "Cookie" and "Set-Cookie" headers seem to be subsequently and
>> lastly defined in RFC 6265 [2].
>> In section 4.1.1 [3], the syntax of these headers is defined, as :
>>
>>  cookie-pair       = cookie-name "=" cookie-value
>>  cookie-name       = token
>>  cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
>>  cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
>>                        ; US-ASCII characters excluding CTLs,
>>                        ; whitespace DQUOTE, comma, semicolon,
>>                        ; and backslash
>>  token             = <token, defined in [RFC2616], Section 2.2>
>>
>> Thus, it seems that you are right, and that a cookie *value* can
>> (regrettably still) only consist of US-ASCII characters (not including "é"
>> thus).
>>
>> (I cannot find in the specs a way to quote a non-US-ASCII character
>> either; that's apparently only allowed in parts defined as "comments")
>>
>> (It is stated somewhere else in RFC 6265 that it is recommended to encode
>> the Cookie value via e.g. Base64, if it were to potentially contain non
>> US-ASCII characters).
>>
>> The cookie name is a "token", and the definition of "token" sends us back
>> to RFC2616.
>> In "2.2 Basic Rules", RFC2616 states :
>>
>>    token          = 1*<any CHAR except CTLs or separators>
>>        separators     = "(" | ")" | "<" | ">" | "@"
>>                       | "," | ";" | ":" | "\" | <">
>>                       | "/" | "[" | "]" | "?" | "="
>>                       | "{" | "}" | SP | HT
>> ...
>>       CHAR           = <any US-ASCII character (octets 0 - 127)>
>>       CTL            = <any US-ASCII control character
>>                         (octets 0 - 31) and DEL (127)>
>>
>> So, this all would tend to show that you are right, and that Cookie names
>> (as well as values) can only consist of US-ASCII characters, and that "é" is
>> thus not allowed (without some form of encoding that would represent it as a
>> sequence of US-ASCII characters).
>>
>> Which, in my personal opinion is a lasting p-i-t-a and shame.  And I
>> cannot imagine how it can be nowadays that nobody has yet gotten around to
>> proposing a HTTP 2.0 RFC where the default character set would be Unicode,
>> UTF-8 encoded, for everything excluding maybe header names.  But that's
>> neither here nor there.
>>
>> To get back to the original OP's question thus, it seems to me that
>> - Tomcat 7.x would not be in violation of the specs, if it indeed rejects
>> a Cookie header containing any non-US-ASCII character (whether in the cookie
>> name or value).
>> - That the error message could be improved ("é" is not a control
>> character, it's just invalid here)
>> - but that the real fix for the OP may be to Base64-encode the cookie
>> value before sending it to the browser.
>> That's also because it may happen one day that even a browser which
>> respects the specs to the letter (one never knows), could reject a value
>> like : "abcé","abc","abc","abc","abc","abc","abc","abc","abc";
>>
>>
>> [1] http://tools.ietf.org/search/rfc2616
>> [2] http://tools.ietf.org/search/rfc6265
>> [3] http://tools.ietf.org/search/rfc6265#section-4.1.1
>>
>>
>
> As an appendix, and triggered by another post to this list, here is another
> way of encoding HTTP header values :
>
> Cookie: ACE_COOKIE=R660302447; TD3World=R760446058
> SM_TRANSACTIONID:
> =?UTF-8?B?MGE2NDA2MDEtNDAzMy01MjdjYzlkMy0wMDBhLTJjMWI0NjJi?=
> SM_AUTHTYPE: =?UTF-8?B?QXV0bw==?=
> SM_SDOMAIN: =?UTF-8?B?LnRveW90YS1ldXJvcGUuY29t?=
>
> In this case, the cookie values are encoded using a "MIME extension" scheme
> which indicates (between =? ? ?) prior to a string's value, the character
> set/encoding in which the subsequent string is to be interpreted.
> This is not explicitly mentioned in any of the above references, but as I
> recall, this is part of another series of RFC's, maybe starting at this one
> :
> http://tools.ietf.org/html/rfc2184
> Now how this works out (also browser-side) with Cookie headers composed of
> cookie names and values, I couldn't say.
>

RFC 2616
also says the following on page 16:

   The TEXT rule is only used for descriptive field contents and values
   that are not intended to be interpreted by the message parser. Words
   of *TEXT MAY contain characters from character sets other than ISO-
   8859-1 [22] only when encoded according to the rules of RFC 2047
   [14].

       TEXT           = <any OCTET except CTLs,
                        but including LWS>

RFC 2047 is also referenced in Javadoc for HttpServletResponse.setHeader()

The "B" encoding used in an example above is one of encodings allowed
by RFC2047 ch.4.1.

http://www.ietf.org/rfc/rfc2047.txt

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Mime
View raw message