tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: [OT] Basic Authentication Failed with multibyte username
Date Sun, 24 Jan 2010 14:22:20 GMT
Christopher Schultz wrote:
> Hash: SHA1
> André,
> (Marking OT because, well... just because).
> On 1/22/2010 2:59 PM, Warnier wrote:
>> Christopher Schultz wrote:
>>> That "authorization.getBytes()" is just asking for trouble, because it
>>> uses the platform default encoding to convert characters to bytes. It
>>> should be using US-ASCII, ISO-8859-1, or something like that.
>> -1
>> I don't think you have a problem there, because what you are decoding
>> into bytes there IS bytes (it is base64-encoded).
> Maybe all character sets have bytes 0-127 the same as US-ASCII, but I
> don't know about some of those I never see myself: Shift-JS and all
> those Asian encodings, etc. It would be better to be explicit.

With respect, I think you are mistaken here.
Base64 encoding is essentially a method to encode pairs of bytes into
triplets of bytes, in such a way that no byte in the resulting triplet
has the high bit set. (Use "octet" instead of "byte" if it is more
Basically, it was created in order to allow 8-bit character data to be
sent over an 7-bit channel.
So there is no character set implication at all in either encoding or
decoding :
- to encode, you take each group of 2 bytes, and encode it into a group
of 3 bytes
- to decode, you take each group of 3 bytes, and decode it into a group
of 2 bytes.

So maybe the "authorization.getBytes()" above is wrong intellectually
(if it implies that "authorization" is some kind of string expressed in
a character set). The Base64-encoded "string" should really be read as 
bytes, because that is what it is.

The next step after the base64-decoding is where it matters : now we 
have an array of bytes with values 0-255, and we have to interpret it 
into a "userid:password" string which /might/ be us-ascii or iso-8859-1, 
but might also be something else.
But is is impossible to know which character set the browser used,
just by examining that series of bytes.  Inherently, nothing
distinguishes a series of bytes from another, and they could just as
well represent an iso-8859-1 string, as an iso-8859-2,3,4,5.. or a UTF-8
You can examine a series of bytes and tell whether it could
be a valid UTF-8 string (because some byte sequences are not possible
under UTF-8).  But even if it could be valid UTF-8, does not mean that
it is UTF-8; and distinguishing different iso-8859-x byte sequences from 
one another is totally impossible.

Example :
We receive a base64 authorization token, which once it is base64-decoded 
, results in the following series of octets shown in hex :
73 63 68 75 6C 74 7A 3A C3 A9 74 C3 A9
If we decode this as being utf-8, we get the string
and we would thus suppose that this userid is "shultz" and his password
is "été".
But if we decide that the origin character set was iso-8859-1, then we
would decode it into
and the user would still be "schultz", but his password would be "été"
(which would be an equally-valid password).
There is no way to decide in the absolute which decoding is "right",
in the absence of more information.

So there are only 2 choices possible :

1) the rules specify that the base64-decoded "userid:password"
string is always encoded using one specific charset.  In the case of
HTTP, this would have to be iso-8859-1.
(And in that case, HTTP Basic Authentication does not allow for
non-iso-8859-1 userid's and passwords, and too bad for 80% of the world 


2) the rules specify something like :
- if the base64-decoded authorization token does not start with the
iso-8859-1 characters "=?", then it is interpreted as iso-8859-1 (the 
- if it starts with "=?" and ends with "?=", then it is interpreted as a 
rfc2047-encoded token, to be decoded using the charset indicated after 
the leading "=?".
(And user-id's starting with "=?" are forbidden, but that's not a very 
likely case nor a big limitation).

So back to Gábor's original problem :

His specific "client" is not a browser, and it allows a user:password 
string to contain non-iso-8859-1 characters, and it encodes it in UTF-8, 
prior to encoding it with base64.

At the Tomcat level :

If Gábor modifies the Tomcat container-managed Basic Authentication 
code, so that it will first base64-decode the token, then convert it to 
a string using UTF-8 encoding, that will work for requests from this 
special client.  But it will break with any other client.

If Gábor can distinguish requests from this special client, from 
requests from standard clients, then he could make the UTF-8 decoding 
conditional on where the request comes from.
If this is done in the container-based Basic Authentication code, then 
it would still result in a non-standard Tomcat, but at least it would 
not break with normal clients.

If Gábor drops the container-based authentication, and uses a servlet 
filter like SecurityFilter (modified the same way), then that would have 
the advantage of keeping a standard Tomcat, and also of working with 
other servlet containers.

But if Gábor can modify the client to first encode the token following 
RFC 2047, and then modify the Tomcat container-based Basic 
Authentication code to handle it as suggested above, then he could 
probably claim the first client/server combination which is totally 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message