manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Timeout problems with web crawling
Date Tue, 23 Apr 2013 10:22:51 GMT

I'm still having problems with web crawling using trunk with updated 
Http client. It seems that the problems occur when Solr is password 
protected even though the error messages in my logs indicate a timeout 
problem. I'm not 100 % sure, but it seems that the problem starts as 
soon as I'm enabling password protection.

We have struggled a lot with the web crawler in production mode 
recently, but I thought that we managed to get around these problems 
when "expect 100 continue" was added to the header (now added in trunk). 
Then we discovered a Resin bug which sent a wrong http status code back 
when this header was enabled, but this has been solved by moving the 
authentication configuration to Apache HTTP server instead (using 
.htaccess). So everything *should* work, but it doesn't. Now I have 
managed to reproduce the problems on our test sever as well when I added 
full password protection for the Solr test server. As I wrote above, the 
logs does not seem to report problems with the Solr server, but the 
crawled resources instead.

I have added two logs. One from the production server, and another from 
the test server. Log level is set to DEBUG for HttpClient. The prod job 
just stops and hangs, maybe due to a db lock. The test stops with the 
message "Error: Repeated service interruptions - failure processing 
document: null" ("read timed out" in simple history).

The logs are available here:
http://folk.uio.no/erlendfg/manifoldcf/

Erlend

-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message