cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 27802] - EncodeURLTransformer encodes off site links
Date Fri, 26 Mar 2004 14:32:05 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=27802>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=27802

EncodeURLTransformer encodes off site links





------- Additional Comments From christian.mayrhuber@gmx.net  2004-03-26 14:32 -------
Updated EncodeURLTransformer and ElementAttributeURLMatcher to support
include patterns for URL's.
Default include-url pattern is ".*";
Default exclude-url pattern is 
        "http:.*|https:.*|ftp:.*|#.*|mailto:.*|news:.*|" + 
        "nntp:.*|telnet:.*|prospero:.*|z39.50s:.*|z39.50r:.*|" +
        "cid:.*|mid:.*|vemmi:.*|service:.*|imap:.*|nfs:.*|" +
        "acap:.*|rtsp:.*|tip:.*|pop:.*|data:.*|dav:.*|gopher:.*|" +
        "opaquelocktoken:.*|sip:.*|sips:.*|tel:.*|fax:.*|" +
        "modem:.*|ldap:.*|soap.beep:.*|soap.beeps:.*|afs:.*|" +
        "xmlrpc.beep:.*|xmlrpc.beeps:.*|urn:.*|go:.*|h323:.*|" +
        "ipp:.*|tftp:.*|mupdate:.*|pres:.*|im:.*|wais:.*|" +
        "file:.*|tn3270:.*|mailserver:.*";
and matches all URLs from IANA registry http://www.iana.org/assignments/uri-schemes

Sitemap usage:
<map:transformer logger="sitemap.transformer.encodeURL" name="encodeURL"
  src="org.apache.cocoon.transformation.EncodeURLTransformer">
    <exclude-url>http:.*|#.*|myprotocol.*</exclude-url>
    <include-url>.*</include-url>
</map:transformer>

The main default behavioural change is that EncodeURLTransformer will only
rewrite relative URL's, like "foo/bar/index.xml". It will not rewrite fully
qualified URL's starting by an IANA registered protocol, nor document fragment
URL's, like "#some-reference".

I am not sure how useful the <include-url> pattern match is, it is merely for
completeness.

What is it useful for?
Well, I have to support many legacy html documents, which have been
published and must not be altered. These documents may contain
links to remote resources. If these links are URLEncoded they stop working,
because the remote side issues an 404 Error, document not found. 
Example: http://www.cnn.com/;jsessionid=35kjsjkj54kslfjdlkj6l5j6lsjf

A probable work around would be to transform such links to some private
namespace prior to URL encoding and transform them back to href's
after URL Encoding.
At least one guy had the same issues:
http://marc.theaimsgroup.com/?l=xml-cocoon-users&m=107416883114549&w=2

What do you think?

Mime
View raw message