tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <>
Subject Re: [OT] Basic Authentication Failed with multibyte username
Date Mon, 25 Jan 2010 21:06:35 GMT
Hash: SHA1


On 1/24/2010 9:22 AM, André Warnier wrote:
> Christopher Schultz wrote:
>> Maybe all character sets have bytes 0-127 the same as US-ASCII, but I
>> don't know about some of those I never see myself: Shift-JS and all
>> those Asian encodings, etc. It would be better to be explicit.
> With respect, I think you are mistaken here.
> Base64 encoding is essentially a method to encode pairs of bytes into
> triplets of bytes, in such a way that no byte in the resulting triplet
> has the high bit set. (Use "octet" instead of "byte" if it is more
> comfortable).

It's more than that: it uses an explicit set of characters in the
US-ASCII encoding as display. If you were to Base64 encode a string and
then transmit it as EBCDIC, it would look the same to human eyes but
have different underlying byte values (octets, if you prefer).

> Basically, it was created in order to allow 8-bit character data to be
> sent over an 7-bit channel.
> So there is no character set implication at all in either encoding or
> decoding :
> - to encode, you take each group of 2 bytes, and encode it into a group
> of 3 bytes
> - to decode, you take each group of 3 bytes, and decode it into a group
> of 2 bytes.

Actually, I was wrong above: it's not a US-ASCII encoding. Instead, the
byte values are an index into a string of characters, as described in
the reference-less Wikipedia article:

The buffer is then used, six bits at a time, most significant first, as
indices into the string:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/", and
the indicated character is output.

So, in DBCDEC, the human reader would be confused :(

> So maybe the "authorization.getBytes()" above is wrong intellectually
> (if it implies that "authorization" is some kind of string expressed in
> a character set). The Base64-encoded "string" should really be read as
> bytes, because that is what it is.

Fair enough, though the above string fits nicely into US-ASCII which,
coincidentally, is the official encoding of HTTP headers :)

> The next step after the base64-decoding is where it matters

I agree, and here's where your arguments fall on deaf ears: each client
does whatever it wants with regard to encoding of this data. The major
web browsers don't even agree on what to do. Since the OP has his own
client (right? or have I gotten confused with one or two other threads
this week), he can do whatever he wants as long as the authentication
mechanism agrees with the client.

> But is is impossible to know which character set the browser used,
> just by examining that series of bytes.

Almost certainly true, although a tight client/server relationship could
include a scheme to indicate the encoding in the value itself. Something
like RFC2047, for instance.

> So there are only 2 choices possible :
> 1) the rules specify that the base64-decoded "userid:password"
> string is always encoded using one specific charset.  In the case of
> HTTP, this would have to be iso-8859-1.
> (And in that case, HTTP Basic Authentication does not allow for
> non-iso-8859-1 userid's and passwords, and too bad for 80% of the world
> population)

I disagree: the spec is unclear about the encoding used before the
Base64 encoding. This is the source of the problem because clients have
decided to take it upon themselves to decide what is best (UTF-8, page
encoding, random encoding, no encoding, etc.).

> 2) the rules specify something like :
> - if the base64-decoded authorization token does not start with the
> iso-8859-1 characters "=?", then it is interpreted as iso-8859-1 (the
> default)
> - if it starts with "=?" and ends with "?=", then it is interpreted as a
> rfc2047-encoded token, to be decoded using the charset indicated after
> the leading "=?".
> (And user-id's starting with "=?" are forbidden, but that's not a very
> likely case nor a big limitation).

That would be a great implementation, but nobody appears to have done
it. If the OP wants to use this strategy, he'll have to hack Tomcat's
authenticator to accept this type of encoding... or use something like
Securityfilter, again, with a patch to accept this type of encoding.

> So back to Gábor's original problem :
> His specific "client" is not a browser, and it allows a user:password
> string to contain non-iso-8859-1 characters, and it encodes it in UTF-8,
> prior to encoding it with base64.

Fortunately, he has control over the client, which is great.

> At the Tomcat level :
> If Gábor modifies the Tomcat container-managed Basic Authentication
> code, so that it will first base64-decode the token, then convert it to
> a string using UTF-8 encoding, that will work for requests from this
> special client.  But it will break with any other client.


> If Gábor can distinguish requests from this special client, from
> requests from standard clients, then he could make the UTF-8 decoding
> conditional on where the request comes from.


> If this is done in the container-based Basic Authentication code, then
> it would still result in a non-standard Tomcat, but at least it would
> not break with normal clients.


> If Gábor drops the container-based authentication, and uses a servlet
> filter like SecurityFilter (modified the same way), then that would have
> the advantage of keeping a standard Tomcat, and also of working with
> other servlet containers.


> But if Gábor can modify the client to first encode the token following
> RFC 2047, and then modify the Tomcat container-based Basic
> Authentication code to handle it as suggested above, then he could
> probably claim the first client/server combination which is totally
> spec-compliant.
> ;-)

All permutations of encoding are spec-compliant simply because they
don't violate the spec, which is silent on this issue. Only one part of
this is clear to me: adding the RFC2047-style encoding of the
username:password is almost certainly in violation of the specification,
since it says to Base64-encode "username:password", not
"encode("username:password")" where 'encode' is the RFC2047 strategy.

Just because it's not spec-compliant doesn't mean I think it's not a
good idea, though :)

- -chris
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla -


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message