manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1155) Web connector should not be sending the port number in request header field Host
Date Tue, 03 Mar 2015 07:18:04 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344655#comment-14344655
] 

Karl Wright commented on CONNECTORS-1155:
-----------------------------------------

I believe that this issue is covered in HTTPCLIENT-1513, which was fixed early in 2014.  We
should have the latest version of HttpClient in 1.8.2 and 2.0.2, so I'm resolving this ticket.


> Web connector should not be sending the port number in request header field Host
> --------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1155
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1155
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.7.2
>            Reporter: Denis Beck
>            Assignee: Karl Wright
>
> The web connector sends the port number in the request header field Host (e.g. Host:
www.apache.org:443). This causes redirect rules for the host name to fail. The port number
should not be part of the Host header.
> On the other hand RFC 2616 section 14.23 (http://tools.ietf.org/html/rfc2616#section-14.23)
says “The Host request-header field specifies the Internet host and port number of the resource
being requested [...]”.
> I encountered this issue while trying to crawl a customer’s website. The very first
call to the seed URL caused a redirect which contained a link to the original URL itself and
the job ended without fetching anything. The Simple History showed Status 301, that's it.
Maybe the web connector does not follow the link in the redirect correctly?
> The redirect couldn't be triggered otherwise: I tried a browser and cURL. ManifoldCF's
web connector was the only one sending the port number with the Host header and wasn't able
to crawl the website due to this behavior.
> This issue could be worked around collaborating with the contractor which hosted the
customer's website. He added an exception for these requests. But in general, I think this
should be fixed, as such collaboration is not always possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message