manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Timeout problems with web crawling
Date Tue, 23 Apr 2013 10:50:25 GMT
The solr indexing seems to be working fine on the test host.  I haven't
verified that is true on the production host.  The cause of the production
host hanging, though, may be the really awful stuffer query plan.  It seems
to hang but in fact just gets very very slow.

Can you dump the postgresql schema that is in place on the production
machine?  Specifically, I want to see the jobqueue table's indexes.

I do not see any exceptions at all logged either place.  If there's a
service interruption, usually a warning log entry is dumped.  Not seeing
that though.




On Tue, Apr 23, 2013 at 6:22 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>wrote:

>
> I'm still having problems with web crawling using trunk with updated Http
> client. It seems that the problems occur when Solr is password protected
> even though the error messages in my logs indicate a timeout problem. I'm
> not 100 % sure, but it seems that the problem starts as soon as I'm
> enabling password protection.
>
> We have struggled a lot with the web crawler in production mode recently,
> but I thought that we managed to get around these problems when "expect 100
> continue" was added to the header (now added in trunk). Then we discovered
> a Resin bug which sent a wrong http status code back when this header was
> enabled, but this has been solved by moving the authentication
> configuration to Apache HTTP server instead (using .htaccess). So
> everything *should* work, but it doesn't. Now I have managed to reproduce
> the problems on our test sever as well when I added full password
> protection for the Solr test server. As I wrote above, the logs does not
> seem to report problems with the Solr server, but the crawled resources
> instead.
>
> I have added two logs. One from the production server, and another from
> the test server. Log level is set to DEBUG for HttpClient. The prod job
> just stops and hangs, maybe due to a db lock. The test stops with the
> message "Error: Repeated service interruptions - failure processing
> document: null" ("read timed out" in simple history).
>
> The logs are available here:
> http://folk.uio.no/erlendfg/**manifoldcf/<http://folk.uio.no/erlendfg/manifoldcf/>
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
> 31050
>

Mime
View raw message