From owner-new-httpd@hyperreal.com Mon Oct 2 11:14:11 1995 Received: by taz.hyperreal.com (8.6.12/8.6.5) id LAA02264; Mon, 2 Oct 1995 11:14:11 -0700 Received: from austin.bsdi.com by taz.hyperreal.com (8.6.12/8.6.5) with ESMTP id LAA02245; Mon, 2 Oct 1995 11:13:58 -0700 Received: from austin.bsdi.com (sanders#dcbHVm2AojYNVIuF395F7BP3a5MUCSSA#@localhost [127.0.0.1]) by austin.bsdi.com (8.6.12/8.6.12) with ESMTP id NAA26814 for ; Mon, 2 Oct 1995 13:13:43 -0500 Message-Id: <199510021813.NAA26814@austin.bsdi.com> To: new-httpd@hyperreal.com Subject: The Encoding Problem (was # in file names...) In-reply-to: Ben Laurie's message of Sun, 01 Oct 1995 18:12:29 BST. References: <9510011812.aa03491@gonzo.ben.algroup.co.uk> From: Tony Sanders Organization: Berkeley Software Design, Inc. Date: Mon, 02 Oct 1995 13:13:42 -0500 Sender: owner-new-httpd@apache.org Precedence: bulk Reply-To: new-httpd@apache.org > > > > [# in a directory index] For reference I've included my encoding functions below. They are in perl but are obvious enough that I think anyone can read them. Now, some information, history and why ';' is also a reserved character... WARNING: encode_attribute is tricky. As the spec says, you must use SGML entities for escaping markup inside markup attributes (this is how SGML works -- unless you specify your own escaping scheme of course, like % does for URLs) but as you might guess, many browsers get this wrong (including W3's own browsers: www linemode and arena). Netscape and Mosaic however do work correctly in this respect. There is a serious problem with the current state of things. Example -- a simple form fragment: Now -- let's look at the URL: http://whatever/myform?type=foo&=bar And then the user cuts and pastes this URL into a hypertext link: And guess what!!! Netscape and Mosaic will correctly (according to the spec) send to server something you didn't quite expect: GET myform?type=foo&=bar HTTP/1.0 Yikes! And in this case you cannot escape the & with %26 either because that hides the & from the form processing software on the server. So you simply *must* use ``&'' here to escape the ``&'' in the form if you store it as a URL (because HTML doesn't define any alternate encode besides % and as I've shown, this doesn't work in this case). This is where `;' enters the picture. The current HTML spec recommends that the server accept `;' in place of `&' -- this would at least allow users to portibly store form queries in their HREF's (and since this is server specific there is no harm if my server does and yours doesn't -- users just have to know which works and which doesn't). Of course, as you already know, if the entity ref isn't valid then Netscape and Mosaic will send the correct thing. This is another fine example of how ``forgiving'' software gets us into a horrible situation -- if everyone would just stick to the spec and reject all the crap out there then these kinds of problems would be quickly fixed as soon as they popped up and we wouldn't be stuck with all the crap we currently have. ----- cut here ----- # encode unknown data for use in a URL sub encode_url { local($_) = @_; # rfc1738 says that ";"|"/"|"?"|":"|"@"|"&"|"=" may be reserved. # And % is the escape character so we escape it along with # single-quote('), double-quote("), grave accent(`), less than(<), # greater than(>), and non-US-ASCII characters (binary data), # and white space. Whew. s/([\000-\032\;\/\?\:\@\&\=\%\'\"\`\<\>\177-\377])/sprintf('%%%02x',ord($1))/eg; $_; } # encode unknown data for use in ...</TITILE> sub encode_title { # like encode_url but less strict (I couldn't find docs on this) local($_) = @_; s/([\000-\031\%\&\<\>\177-\377])/sprintf('%%%02x',ord($1))/eg; $_; } # encode unknown data for use inside markup attributes <MARKUP ATTR="..."> sub encode_attribute { # rfc1738 says to use entity references here local($_) = @_; s/([\000-\031\"\'\`\%\&\<\>\177-\377])/sprintf('\&#%03d;',ord($1))/eg; $_; } # encode unknown text data for using as HTML, # treats ^H as overstrike ala nroff. sub encode_data { local($_) = @_; local($str); # Escape binary data except for ^H which we process below # \375 gets turned into the & for the entity reference s/([^\010\012\015\032-\176])/sprintf('\375#%03d;',ord($1))/eg; # Process ^H sequences, we use \376 and \377 (already escaped # above) to stand in for < and > until those characters can # be properly escaped below. s,((_\010.)+),($str = $1) =~ s/.\010//g; "\376I\377$str\376/I\377";,ge; s,((.\010.)+),($str = $1) =~ s/.\010//g; "\376B\377$str\376/B\377";,ge; s,\376[IB]\377_\376/[IB]\377,,g; s/.[\b]//g; # just do an erase for anything else # Escape &, < and > s/\&/\&\;/g; s/\</\<\;/g; s/\>/\>\;/g; # Now convert our magic chars into our tag markers s/\375/\&/g; s/\376/</g; s/\377/>/g; $_; }