manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Denis Beck (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CONNECTORS-1155) Web connector should not be sending the port number in request header Host
Date Thu, 29 Jan 2015 15:26:35 GMT

     [ https://issues.apache.org/jira/browse/CONNECTORS-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Denis Beck updated CONNECTORS-1155:
-----------------------------------
    Summary: Web connector should not be sending the port number in request header Host  (was:
Web connector should not be sending the port in request header Host)

> Web connector should not be sending the port number in request header Host
> --------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1155
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1155
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.7.2
>            Reporter: Denis Beck
>
> The web connector sends the port in the request header Host (e.g. Host: www.apache.org:443).
This causes redirect rules for the host name to fail. The port should not be part of the Host
header.
> On the other hand RFC 2616 section 14.23 (http://tools.ietf.org/html/rfc2616#section-14.23)
says “The Host request-header field specifies the Internet host and port number of the resource
being requested [...]”.
> I encountered this issue while trying to crawl a customer’s website. The very first
call to the seed URL caused a redirect which contained a link to the original URL itself and
the job ended without fetching anything. The Simple History showed Status 301, that's it.
Maybe the web connector does not follow the link in the redirect correctly?
> The redirect couldn't be triggered otherwise: I tried a browser and cURL. ManifoldCF's
web connector was the only one sending the port with the Host header and wasn't able to crawl
the website due to this behavior.
> This issue could be worked around collaborating with the contractor which hosted the
customer's website. He added an exception for these requests. But in general, I think this
should be fixed, as such collaboration is not always possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message