manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Denis Beck (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CONNECTORS-1155) Web connector should not be sending the port in request header Host
Date Thu, 29 Jan 2015 15:25:40 GMT
Denis Beck created CONNECTORS-1155:
--------------------------------------

             Summary: Web connector should not be sending the port in request header Host
                 Key: CONNECTORS-1155
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1155
             Project: ManifoldCF
          Issue Type: Bug
          Components: Web connector
    Affects Versions: ManifoldCF 1.7.2
            Reporter: Denis Beck


The web connector sends the port in the request header Host (e.g. Host: www.apache.org:443).
This causes redirect rules for the host name to fail. The port should not be part of the Host
header.

On the other hand RFC 2616 section 14.23 (http://tools.ietf.org/html/rfc2616#section-14.23)
says “The Host request-header field specifies the Internet host and port number of the resource
being requested [...]”.

I encountered this issue while trying to crawl a customer’s website. The very first call
to the seed URL caused a redirect which contained a link to the original URL itself and the
job ended without fetching anything. The Simple History showed Status 301, that's it. Maybe
the web connector does not follow the link in the redirect correctly?

The redirect couldn't be triggered otherwise: I tried a browser and cURL. ManifoldCF's web
connector was the only one sending the port with the Host header and wasn't able to crawl
the website due to this behavior.

This issue could be worked around collaborating with the contractor which hosted the customer's
website. He added an exception for these requests. But in general, I think this should be
fixed, as such collaboration is not always possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message