tomcat-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kitching Simon <>
Subject RE: off-topic: handling non-ascii characters in URLs
Date Fri, 05 Jan 2001 13:18:34 GMT

> -----Original Message-----
> From:	Birte Glimm []
> Sent:	Friday, January 05, 2001 12:15 PM
> To:
> Subject:	RE: off-topic: handling non-ascii characters in URLs
> True,
> it`s the Browser that encodes the special chars I think. I sometimes had
> problems with not encoded URL`s in Netscape, but the IE always translates
> them right.
> Birte Glimm
	[Kitching Simon]  
	The problem is that there are multiple different encoding schemes. 
	If IE is "translating them right" then what rules exactly is it

	Characters are transmitted as bytes (ie a number from 0 to 255);
	in order for two communicating parties to interpret a particular
	correctly, they need to agree on what encoding scheme to use -
	in advance, or by the sending party indicating the encoding scheme.
	can't find where in the specs it says how to define the encoding
	for characters in urls.

	As an example, a webserver might interpret data like:
	* urls are always 7-bit-ascii
	* urls are always latin-1
	* urls are always UTF-8
	or there is some way to define the encoding of a url when sending a
url to 
	a webserver - but I can't see how.

	Note that the byte 0xE9 can mean different things:
	* in 7-bit-ascii, it is invalid
	* in latin-1 it is an e-accent
	* in latin-2 (nordic languages) it is something else
	* in UTF-8, it is interpreted as the first byte of a multi-byte

	In practice, it seems to me that latin-1 (ie ISO-8859-1) is being
used, ie
	for those of us who don't use any character not in latin-1, we don't
	any problems. However, I can't see anywhere in the specs that says
	HTTP-compliant apps must use latin-1. And what happens if you want
	use non-latin-1 characters in a url, or in a form using
	Examples of languages using characters not in latin-1 are turkish,
	finnish, chinese, ...

	Here is an interesting quote from RFC 2396

	"A URI is a sequence of characters from a very limited set, i.e. the
	of the basic Latin alphabet, digits, and a few special characters".

	This tends to imply that all non-ascii characters *must* be
transformed into
	a %xx form; that's fine (with the implication that data sent to a
	via GET must also be encoded in this way), but the %xx still is an
index into
	**some unknown character set**!!! How can the recipient (eg a
webserver) know
	which character set is it an index into?

	Another quote from RFC 2396:

	"Internet protocols that transmit octet sequences intended to
represent character
	sequences are expected to provide some way of identifying the
charset used, if
	there might be more than one [RFC2277]. However, there is currently
no provision
	within the generic URI syntax to accomplish this identification".

	This says clearly that it is the HTTP protocol's responsibility to
find some way to
	define the character set used in URLs transmitted over HTTP - which
leads back
	to the HTTP RFC, in which I could find no such way of defining the
charset for URIs
	in the situation where a browser is sending a request to a web


	Perhaps someone out there working in Japanese/Chinese/similar can
give some
	feedback on this? You must have to deal with this all the time...



> -----Original Message-----
> From: Kitching Simon []
> Sent: Freitag, 5. Januar 2001 11:58
> To: ''
> Subject: off-topic: handling non-ascii characters in URLs
> Hi All,
> While following a related thread (RE: a simple test to charset), 
> a question occured to me about charset encodings in URLs. 
> This isn't really tomcat-related (more to do with HTTP standards) 
> but thought someone here might be able to offer an answer.
> When a webserver sends content to a browser, it can indicate
> the character data format (ascii, latin-1, UTF8, etc) as an http
> header. However, how is the character data type specified for data
> send *by* a browser *to* a webserver (ie GET or POST action)?
> Andre Alves had an example where an e-accent character
> was part of the URL. I saw that IE4 replaced this character
> with %E9 when submitting a form using GET method, but this
> really assumes that the receiving webserver is using latin-1.
> There is this thing called an "entity-header" defined in the HTTP
> specs, which may contain a "content-encoding" entry. This seems
> to cover POST urls ok then, as the POSTed data is in an entity-body,
> and therefore an entity-header can be used to define its encoding.
> But the URLs themselves cannot have their encoding specified by 
> an entity-header, because they are not in an entity-body. So does
> this mean that all URLs should be restricted to ascii, and forms
> should not use GET method unless their data content is guarunteed
> to be all-ascii??  I remember seeing an article recently about domain
> names now being available in asian ideogram characters, which seems
> to indicate otherwise....
> Any comments?
> Cheers,
> Simon
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, email:

View raw message