nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Fellows (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-18) Windows servers include illegal characters in URLs
Date Wed, 26 Apr 2006 23:24:03 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-18?page=comments#action_12376601 ] 

Chris Fellows commented on NUTCH-18:
------------------------------------

So checking out other SE's, google and yahoo use decoded display urls. ie en.wiktionary.org/wiki/ç
whereas altavista uses encoded urls ie. en.wiktionary.org/wiki/%C3%A7

I would say that the human readable, decoded urls is the way to go, especially since google
and yahoo both support this. Its a small item, but it's one that many users will experience.

The code that controls this is in search.jsp:

<span class="url"><%=Entities.encode(url)%></span>

I need the decoded forms for my project. If any contributors want the change I'll submit the
one file patch for the decoded urls.

If any contributers want the url completely encoded per RFC1738 for use in fetching and searching,
then I can submit that patch as well. This last item is what I believe this bug was opened
for in the first place, though after research posted above, doesn't look like its required.

> Windows servers include illegal characters in URLs
> --------------------------------------------------
>
>          Key: NUTCH-18
>          URL: http://issues.apache.org/jira/browse/NUTCH-18
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Reporter: Stefan Groschupf
>     Priority: Minor

>
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include 
> illegal characters in URLs -- specifically, characters with 
> the high bit set to produce non-English letters. In 
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would 
> help if high-bit characters (and other illegal characters) 
> in URLs could be escaped (using percent-hex notation) 
> as part of the URL fix-up process, probably right after 
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit 
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message