Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C8E551098D for ; Tue, 25 Nov 2014 10:34:12 +0000 (UTC) Received: (qmail 75136 invoked by uid 500); 25 Nov 2014 10:34:12 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 75084 invoked by uid 500); 25 Nov 2014 10:34:12 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 75072 invoked by uid 99); 25 Nov 2014 10:34:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Nov 2014 10:34:12 +0000 Date: Tue, 25 Nov 2014 10:34:12 +0000 (UTC) From: "Karl Wright (JIRA)" To: dev@manifoldcf.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CONNECTORS-1113) Web connection being dropped while still in use? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CONNECTORS-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224352#comment-14224352 ] Karl Wright commented on CONNECTORS-1113: ----------------------------------------- Hi Arcadius, You need to look at the part of the user documentation having to do with session-based login. In order to make things work, you need to identify (a) a protected zone, identified by a regular expression, and (b) the equivalent of a "login sequence" -- which is a sequence of pages that AREN'T meant to be indexed, but are just meant to set cookies for subsequent fetches. You don't obviously have to do an actual login, but you need to identify the page sequence that results in a properly set cookie. Read about it here, under the Access Credentials tab, under "Session-based authentication": http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#webrepository Warning: this is NOT easy to do > Web connection being dropped while still in use? > ------------------------------------------------ > > Key: CONNECTORS-1113 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1113 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 1.7.2 > Reporter: Arcadius Ahouansou > > Hello. > I am using ManifoldCF web crawler for crawling a web site and index into Solr. > I have noticed that for most websites everything is OK. > However, for some, Manifold is unable to crawl i.e nothing pushed to Solr and the log shows entries like > *Cancelling request execution* > Please, see below for more detail. > At this point, I am not very sure what is causing this. It may have to do with the Gzip or the Keep-Alive header sent by the server? > {code} > DEBUG org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:122) 2014-11-24 02:15:51,710 (Thread-5783) - CookieSpec selected: compatibility > DEBUG org.apache.http.client.protocol.RequestAuthCache.process(RequestAuthCache.java:75) 2014-11-24 02:15:51,712 (Thread-5783) - Auth cache not set in the context > DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:217) 2014-11-24 02:15:51,714 (Thread-5783) - Opening connection {}->http://mysite.co.uk:80 > DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:120) 2014-11-24 02:15:51,746 (Thread-5783) - Connecting to mysite.co.uk/11.11.11.11:80 > DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:127) 2014-11-24 02:15:51,762 (Thread-5783) - Connection established 192.168.1.5:42919<->11.11.11.11:80 > DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:238) 2014-11-24 02:15:51,763 (Thread-5783) - Executing request GET /hot/search/ HTTP/1.1 > DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:243) 2014-11-24 02:15:51,763 (Thread-5783) - Target auth state: UNCHALLENGED > DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:249) 2014-11-24 02:15:51,764 (Thread-5783) - Proxy auth state: UNCHALLENGED > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:124) 2014-11-24 02:15:51,764 (Thread-5783) - http-outgoing-1 >> GET /hot/search/ HTTP/1.1 > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; webbot@crawler.net) > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> From: webbot@crawler.net > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> Accept: */* > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Accept-Encoding: gzip,deflate > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Host: mysite.co.uk:80 > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Connection: Keep-Alive > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> "GET /hot/search/ HTTP/1.1[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,767 (Thread-5783) - http-outgoing-1 >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; webbot@crawler.net)[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,768 (Thread-5783) - http-outgoing-1 >> "From: webbot@crawler.net[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Accept: */*[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Accept-Encoding: gzip,deflate[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Host: mysite.co.uk:80[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "Connection: Keep-Alive[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783) - http-outgoing-1 >> "[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,841 (Thread-5783) - http-outgoing-1 << "HTTP/1.1 200 OK[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783) - http-outgoing-1 << "Date: Mon, 24 Nov 2014 02:17:06 GMT[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783) - http-outgoing-1 << "Server: Apache[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783) - http-outgoing-1 << "Vary: Accept-Encoding[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Content-Encoding: gzip[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Content-Length: 20[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Keep-Alive: timeout=5, max=99[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783) - http-outgoing-1 << "Connection: Keep-Alive[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783) - http-outgoing-1 << "Content-Type: text/html[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783) - http-outgoing-1 << "[\r][\n]" > DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:86) 2014-11-24 02:15:51,848 (Thread-5783) - http-outgoing-1 << "[0x1f][0x8b][0x8][0x0][0x0][0x0][0x0][0x0][0x0][0x3][0x3][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0]" > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:113) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << HTTP/1.1 200 OK > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Date: Mon, 24 Nov 2014 02:17:06 GMT > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Server: Apache > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/ > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada; expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/ > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Vary: Accept-Encoding > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Encoding: gzip > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Length: 20 > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Keep-Alive: timeout=5, max=99 > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Connection: Keep-Alive > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116) 2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Content-Type: text/html > DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:267) 2014-11-24 02:15:51,853 (Thread-5783) - Connection can be kept alive for 5000 MILLISECONDS > DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117) 2014-11-24 02:15:51,856 (Thread-5783) - Cookie accepted [ci_session="a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...", version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016] > DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117) 2014-11-24 02:15:51,860 (Thread-5783) - Cookie accepted [ci_session="a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...", version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016] > DEBUG org.apache.http.impl.execchain.ConnectionHolder.cancel(ConnectionHolder.java:140) 2014-11-24 02:15:51,866 (Thread-5783) - Cancelling request execution > DEBUG org.apache.http.impl.conn.CPoolEntry.isExpired(CPoolEntry.java:81) 2014-11-24 02:15:57,017 (Idle cleanup thread) - Connection [id:1][route:{}->http://mysite.co.uk:80][state:null] expired @ Mon Nov 24 02:15:56 GMT 2014 > DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.close(LoggingManagedHttpClientConnection.java:79) 2014-11-24 02:15:57,019 (Idle cleanup thread) - http-outgoing-1: Close connection > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)