incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ANY23-55) any23 is not following the redirection
Date Wed, 07 Mar 2012 11:04:57 GMT

    [ https://issues.apache.org/jira/browse/ANY23-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224203#comment-13224203
] 

Lewis John McGibbney commented on ANY23-55:
-------------------------------------------

Hi Szymon. I don't know about the regex pattern being used here, but over in Nutch we 
- skip URLs containing certain characters as probable queries, etc. e.g. ?*!@= (possibly source
of problem)
- skip URLs with slash-delimited segments that repeats 3+ times, to break loops e.g. .*(/[^/]+)/[^/]+\1/[^/]+\1/
(doesn't look probable)

Also we default to a zero value for the maximum number of redirects the fetcher will follow
when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected
URLs, instead it will record them for later fetching.

Can you confirm what kind of implementation the Sindice crawler is using, does it utilise
the Any23 basic-crawler or does it use some other implementation?
                
> any23 is not following the redirection
> --------------------------------------
>
>                 Key: ANY23-55
>                 URL: https://issues.apache.org/jira/browse/ANY23-55
>             Project: Apache Any23
>          Issue Type: Bug
>         Environment: version 0.6.2-SNAPSHOT deployed currently at any23.org
>            Reporter: Szymon Danielczyk
>
> here is a redirection pattern 
> http://purl.obolibrary.org/obo/IAO_0000030  
> -> 302   Location=http://www.berkeleybop.org/ontologies/IAO_0000030
> http://www.berkeleybop.org/ontologies/IAO_0000030  
> -> 303 Location=http://purl.obolibrary.org/obo/IAO/about/IAO_0000030
> http://purl.obolibrary.org/obo/IAO/about/IAO_0000030 
> -> 302  Location=http://www.ontobee.org/browser/rdf.php?o=IAO&iri=http://purl.obolibrary.org/obo/IAO_0000030
> http://www.ontobee.org/browser/rdf.php?o=IAO&iri=http://purl.obolibrary.org/obo/IAO_0000030

> 200 this is the final correct page
> Any23 reports no matching extractor found 
> for http://purl.obolibrary.org/obo/IAO_0000030 
> - probably it can not follow a redirection on some stage 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message