manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1113) Web connection being dropped while still in use?
Date Tue, 25 Nov 2014 10:34:12 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224352#comment-14224352
] 

Karl Wright commented on CONNECTORS-1113:
-----------------------------------------

Hi Arcadius,

You need to look at the part of the user documentation having to do with session-based login.
 In order to make things work, you need to identify (a) a protected zone, identified by a
regular expression, and (b) the equivalent of a "login sequence" -- which is a sequence of
pages that AREN'T meant to be indexed, but are just meant to set cookies for subsequent fetches.
 You don't obviously have to do an actual login, but you need to identify the page sequence
that results in a properly set cookie.  Read about it here, under the Access Credentials tab,
under "Session-based authentication":  http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html#webrepository




Warning: this is NOT easy to do

> Web connection being dropped while still in use?
> ------------------------------------------------
>
>                 Key: CONNECTORS-1113
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1113
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.7.2
>            Reporter: Arcadius Ahouansou
>
> Hello.
> I am using ManifoldCF web crawler for crawling a web site and index into Solr.
> I have noticed that for most websites everything is OK.
> However, for some, Manifold is unable to crawl i.e nothing pushed to Solr and the log
shows entries like 
> *Cancelling request execution*
> Please, see below for more detail.
> At this point, I am not very sure what is causing this. It may have to do with the Gzip
or the Keep-Alive header sent by the server?
> {code}
> DEBUG org.apache.http.client.protocol.RequestAddCookies.process(RequestAddCookies.java:122)
2014-11-24 02:15:51,710 (Thread-5783) - CookieSpec selected: compatibility
> DEBUG org.apache.http.client.protocol.RequestAuthCache.process(RequestAuthCache.java:75)
2014-11-24 02:15:51,712 (Thread-5783) - Auth cache not set in the context
> DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:217)
2014-11-24 02:15:51,714 (Thread-5783) - Opening connection {}->http://mysite.co.uk:80
> DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:120)
2014-11-24 02:15:51,746 (Thread-5783) - Connecting to mysite.co.uk/11.11.11.11:80
> DEBUG org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:127)
2014-11-24 02:15:51,762 (Thread-5783) - Connection established 192.168.1.5:42919<->11.11.11.11:80
> DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:238)
2014-11-24 02:15:51,763 (Thread-5783) - Executing request GET /hot/search/ HTTP/1.1
> DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:243)
2014-11-24 02:15:51,763 (Thread-5783) - Target auth state: UNCHALLENGED
> DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:249)
2014-11-24 02:15:51,764 (Thread-5783) - Proxy auth state: UNCHALLENGED
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:124)
2014-11-24 02:15:51,764 (Thread-5783) - http-outgoing-1 >> GET /hot/search/ HTTP/1.1
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127)
2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler;
webbot@crawler.net)
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127)
2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> From: webbot@crawler.net
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127)
2014-11-24 02:15:51,765 (Thread-5783) - http-outgoing-1 >> Accept: */*
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127)
2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Accept-Encoding: gzip,deflate
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127)
2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Host: mysite.co.uk:80
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onRequestSubmitted(LoggingManagedHttpClientConnection.java:127)
2014-11-24 02:15:51,766 (Thread-5783) - http-outgoing-1 >> Connection: Keep-Alive
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,766 (Thread-5783)
- http-outgoing-1 >> "GET /hot/search/ HTTP/1.1[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,767 (Thread-5783)
- http-outgoing-1 >> "User-Agent: Mozilla/5.0 (ApacheManifoldCFWebCrawler; webbot@crawler.net)[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,768 (Thread-5783)
- http-outgoing-1 >> "From: webbot@crawler.net[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783)
- http-outgoing-1 >> "Accept: */*[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783)
- http-outgoing-1 >> "Accept-Encoding: gzip,deflate[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783)
- http-outgoing-1 >> "Host: mysite.co.uk:80[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783)
- http-outgoing-1 >> "Connection: Keep-Alive[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,769 (Thread-5783)
- http-outgoing-1 >> "[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,841 (Thread-5783)
- http-outgoing-1 << "HTTP/1.1 200 OK[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783)
- http-outgoing-1 << "Date: Mon, 24 Nov 2014 02:17:06 GMT[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,842 (Thread-5783)
- http-outgoing-1 << "Server: Apache[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783)
- http-outgoing-1 << "Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc;
expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783)
- http-outgoing-1 << "Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada;
expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,843 (Thread-5783)
- http-outgoing-1 << "Vary: Accept-Encoding[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783)
- http-outgoing-1 << "Content-Encoding: gzip[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783)
- http-outgoing-1 << "Content-Length: 20[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783)
- http-outgoing-1 << "Keep-Alive: timeout=5, max=99[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,844 (Thread-5783)
- http-outgoing-1 << "Connection: Keep-Alive[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783)
- http-outgoing-1 << "Content-Type: text/html[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:72) 2014-11-24 02:15:51,847 (Thread-5783)
- http-outgoing-1 << "[\r][\n]"
> DEBUG org.apache.http.impl.conn.Wire.wire(Wire.java:86) 2014-11-24 02:15:51,848 (Thread-5783)
- http-outgoing-1 << "[0x1f][0x8b][0x8][0x0][0x0][0x0][0x0][0x0][0x0][0x3][0x3][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0]"
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:113)
2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << HTTP/1.1 200 OK
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Date: Mon, 24 Nov 2014 02:17:06
GMT
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,849 (Thread-5783) - http-outgoing-1 << Server: Apache
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3B%7D1dec34150fe1ab15f341d355f6ebd0dc;
expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Set-Cookie: ci_session=a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%22ip_address%22%3Bs%3A12%3A%2210.190.254.5%22%3Bs%3A10%3A%22user_agent%22%3Bs%3A59%3A%22Mozilla%2F5.0+%28ApacheManifoldCFWebCrawler%3B+webbot%40crawler.net%29%22%3Bs%3A13%3A%22last_activity%22%3Bi%3A1416795426%3Bs%3A9%3A%22user_data%22%3Bs%3A0%3A%22%22%3Bs%3A4%3A%22lang%22%3BN%3B%7Df6625848d5ca7bf8d5db71617607bada;
expires=Wed, 23-Nov-2016 02:17:06 GMT; path=/
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,850 (Thread-5783) - http-outgoing-1 << Vary: Accept-Encoding
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Encoding: gzip
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Content-Length: 20
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,851 (Thread-5783) - http-outgoing-1 << Keep-Alive: timeout=5, max=99
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Connection: Keep-Alive
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.onResponseReceived(LoggingManagedHttpClientConnection.java:116)
2014-11-24 02:15:51,852 (Thread-5783) - http-outgoing-1 << Content-Type: text/html
> DEBUG org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:267)
2014-11-24 02:15:51,853 (Thread-5783) - Connection can be kept alive for 5000 MILLISECONDS
> DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117)
2014-11-24 02:15:51,856 (Thread-5783) - Cookie accepted [ci_session="a%3A5%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...",
version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016]
> DEBUG org.apache.http.client.protocol.ResponseProcessCookies.processCookies(ResponseProcessCookies.java:117)
2014-11-24 02:15:51,860 (Thread-5783) - Cookie accepted [ci_session="a%3A6%3A%7Bs%3A10%3A%22session_id%22%3Bs%3A32%3A%2248df265e57a5bc5b7ded4175ef109fe0%22%3Bs%3A10%3A%2...",
version:0, domain:mysite.co.uk, path:/, expiry:Wed Nov 23 02:17:06 GMT 2016]
> DEBUG org.apache.http.impl.execchain.ConnectionHolder.cancel(ConnectionHolder.java:140)
2014-11-24 02:15:51,866 (Thread-5783) - Cancelling request execution
> DEBUG org.apache.http.impl.conn.CPoolEntry.isExpired(CPoolEntry.java:81) 2014-11-24 02:15:57,017
(Idle cleanup thread) - Connection [id:1][route:{}->http://mysite.co.uk:80][state:null]
expired @ Mon Nov 24 02:15:56 GMT 2014
> DEBUG org.apache.http.impl.conn.LoggingManagedHttpClientConnection.close(LoggingManagedHttpClientConnection.java:79)
2014-11-24 02:15:57,019 (Idle cleanup thread) - http-outgoing-1: Close connection
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message