nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1098) better url-normalizer basic
Date Wed, 02 Nov 2011 16:51:33 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142291#comment-13142291
] 

Ferdy Galema commented on NUTCH-1098:
-------------------------------------

@Markus/Radim

I certainly do not want to nitpick about patches, but I think feedback about unnecessary changes
or malformed patches should be given. Of course when applying the patch you could simply ignore
or correct them, but in the end higher quality patches benefit all of us. It just makes the
process of reviewing/editing/committing a lot easier.

@Radim

Do you agree that "better url-normalizer basic" is perhaps overly broad? I can probably think
of tens of other improvements that fall under the scope of a better basic urlnormalizer. Discussing
/ managing them in separate issues is much more efficient than cramming them all into a single
one.

Anyway this is not to undermine the effort of course. Keep up the good work! (And feel free
to disagree)

Cheers!
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.5
>
>         Attachments: patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect
space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message