commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rami Ojares <>
Subject Re: [vfs] parsing uri
Date Thu, 10 Mar 2005 21:50:15 GMT
Again quoting the RFC:

For original character sequences that contain non-ASCII characters,
however, the situation is more difficult. Internet protocols that
transmit octet sequences intended to represent character sequences
are expected to provide some way of identifying the charset used, if
there might be more than one [RFC2277].  However, there is currently
no provision within the generic URI syntax to accomplish this
identification. An individual URI scheme may require a single
charset, define a default charset, or provide a way to indicate the
charset used.

It is expected that a systematic treatment of character encoding
within URI will be developed as a future modification of this


I quess http schema sticks to US-ASCII for now.
But maybe with escapes you could access on
some web servers pages like =ääk.html
To be honest I don't know.

Also I don't know if the "systemic treatment" has already happened
or when it will happen.

So it is up to us to decide how we deal with charsets.

Since vfs is written in java it would make sense to first turn
the character sequence of to 16 bit unicode (UTF-16?)
And then encode every character above US-ASCII (7 bit)
or ISO-LATIN-1 (8 bit).

But this would not make the visual representation of URI
very nice. According to URI spec one should be able to read URI
on the radio :-) If you are in japan every character would be encoded
and very difficult to read for the announcer.
But if you don't encode then that URI would look to westerner
a sequence of those boxes that represent character for which there is no 

Let's get practical.
Someone wrote the following uri in ant build file (and some ant task
uses vfs).
Ant when reading the string knows that it is encoded in iso-latin-1
But the string in jvm is in unicode.
Ant gives this string (uri) to vfs that encodes all character above 
so it is now
Now webdav provider makes http request let's say to tomcat.
Question arises:
Can tomcat handle (or the webdav protocol spec) unicode characters
in resource names?
I don't know.
But maybe webdav provider implementor knows.
So if webdav names only handle us-ascii then the provider
can right away say when it is asked to canonicalize the
uri that this is not a proper webdav uri.

Or maybe this is not specified.
And some webdav servers could handle the uri and some could not.
Maybe webdav provider then could ask the server what it supports.
But maybe there is no one standardized single way to ask this.

At this point a sane person starts to give up and thinks:"Whatever!"
Just pass the string and let the user handle errors.

But let's say that webdav can handle iso-latin-1 and
the request is sent to server.

The server's filesystem is encoded in some other
coding (EBCDIC?) that maps ö and ä to a different number.
So in order to do the mapping the webdav server would
need to know what character encoding vfs uses (UTF-16)
in order to do this.

But since this is not specified (at least in the rfc I am quoting)
then it would probably unescape using it's own encoding
and request a wrong resource from it's filesystem.

This state of affairs makes me wonder do the standard makers really
want to make standards or do they just pretend.
The answer is of course that industry wants to make standards to a point.
Because confusion and protectionism makes IT business thrive.

That being said I think one pragmatic approach could be to treat
uri characters to be in from unicode character set. When transported
they would be in US-ASCII where everything above us-ascii is escaped.

So to answer your question
ü = %FB

But all this is just assuming and making things up.
I quess the decision is in your hands since you write the code.

- rami

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message