incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Mostarda <michele.mosta...@gmail.com>
Subject Re: [jira] [Commented] (ANY23-55) any23 is not following the redirection
Date Thu, 08 Mar 2012 16:33:46 GMT
On 7 March 2012 12:04, Lewis John McGibbney (Commented) (JIRA) <
jira@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/ANY23-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224203#comment-13224203]
>
> Lewis John McGibbney commented on ANY23-55:
> -------------------------------------------
>
> Hi Szymon. I don't know about the regex pattern being used here, but over
> in Nutch we
> - skip URLs containing certain characters as probable queries, etc. e.g.
> ?*!@= (possibly source of problem)
> - skip URLs with slash-delimited segments that repeats 3+ times, to break
> loops e.g. .*(/[^/]+)/[^/]+\1/[^/]+\1/ (doesn't look probable)
>
> Also we default to a zero value for the maximum number of redirects the
> fetcher will follow when trying to fetch a page. If set to negative or 0,
> fetcher won't immediately follow redirected URLs, instead it will record
> them for later fetching.
>
> Can you confirm what kind of implementation the Sindice crawler is using,
> does it utilise the Any23 basic-crawler or does it use some other
> implementation?
>

Hi guys,

  if I fully unrdestood the issue here there are no crawler problems (the
basic-crawler is a plugin, not involved in the Any23 Library core
workflow),
the problem is related to the HTTP Commons client configuration used to
implement the org.apache.any23.http.DefaultHTTPClient class which is
involved in the resolution of input data specified as URL.


>
> > any23 is not following the redirection
> > --------------------------------------
> >
> >                 Key: ANY23-55
> >                 URL: https://issues.apache.org/jira/browse/ANY23-55
> >             Project: Apache Any23
> >          Issue Type: Bug
> >         Environment: version 0.6.2-SNAPSHOT deployed currently at
> any23.org
> >            Reporter: Szymon Danielczyk
> >
> > here is a redirection pattern
> > http://purl.obolibrary.org/obo/IAO_0000030
> > -> 302   Location=http://www.berkeleybop.org/ontologies/IAO_0000030
> > http://www.berkeleybop.org/ontologies/IAO_0000030
> > -> 303 Location=http://purl.obolibrary.org/obo/IAO/about/IAO_0000030
> > http://purl.obolibrary.org/obo/IAO/about/IAO_0000030
> > -> 302  Location=
> http://www.ontobee.org/browser/rdf.php?o=IAO&iri=http://purl.obolibrary.org/obo/IAO_0000030
> >
> http://www.ontobee.org/browser/rdf.php?o=IAO&iri=http://purl.obolibrary.org/obo/IAO_0000030
> > 200 this is the final correct page
> > Any23 reports no matching extractor found
> > for http://purl.obolibrary.org/obo/IAO_0000030
> > - probably it can not follow a redirection on some stage
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>


-- 
Michele Mostarda
Senior Software Engineer
skype: michele.mostarda
twitter: micmos
mail: me@michelemostarda.com
site : http://www.michelemostarda.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message