manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions
Date Fri, 16 Mar 2012 10:53:15 GMT
Hi Shigeki,

A "service interruption" means that a connector (either a repository
connector like the web connector or an output connector like the Solr
connector) could not communicate with the configured service.

"Repeated service interruptions" means that certain URLs failed to
fetch properly even after a pattern of retries which lasted many
hours.  ManifoldCF connectors deal with such errors in one of several
ways, depending on the exact details of the error:

- ignore it and proceed
- retry periodically for some time interval, and then give up and proceed
- retry periodically for some time interval, and then shut down the job

It sounds like your job has encountered one of the latter errors.  The
"Error: Repeated service interruptions - failure processing document:
Ingestion HTTP error code 500" indicates that the problem is due to
communication with Solr.  Apparently certain documents you are
indexing are causing Solr to return an error code 500, which is an
"internal server error", and is usually associated with a Solr
exception.  You will need to diagnose why this is, and take corrective
steps, in order for your ManifoldCF job to complete successfully.

"Job no longer active" is harmless - it's a side effect of the job
shutting down.  When a job is shutting down, active document
processing cannot always be interrupted within a connector, but the
framework helps it to stop quickly by throwing this exception.

Thanks,
Karl


2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <shigeki.kobayashi3@g.softbank.co.jp>:
>
> I was crawling web sites with links to html and pdf files on the provided
> multiprocess-example agent for a few hours, then Simple History started
> showing -104 result code with a message saying "Interrupted: Job no longer
> active".
>
> After the same error occurred repeatedly around 40 times, the job status
> became "Aborting" and then ended up with "Error: Repeated service
> interruptions
> - failure processing document: Ingestion HTTP error code 500".
>
> The job was interrupted and stopped.
>
> Does anyone know what situation brings "Repeated service interruptions" and
> has jobs stopped?
> Also in what circumstance an error status code -104 occurs? What is the
> meaning of the code -104?
>
> If you have any ideas, please advise me on how to avoid this error.
>
>
> I am using the followings:
>
> Solr 1.4 (Extracting Request Handler is set)
> ManifoldCF 0.4 (multiprocess-example)
> - Repository connector: WEB
> - Output connector: Solr
> Tomcat 6.0.29
> PostgreSQL 9.1.3
>
>
> Here is MCF’s debug log right before the job was interrupted:
>
> DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to get
> connection to http://xx.xx.xx.xx:80 (95697 ms)
> DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
> before starting fetch on http://xx.xx.xx.xx:80
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to get
> connection to http://xx.xx.xx.xx:80 (99593 ms)
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
> connection to http://xx.xx.xx.xx:80 (99593 ms)
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> HttpClient object
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an HttpClient
> object after 0 ms.
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> '/xx/xx.pdf'
> DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 128 ms.
> DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
>  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> URL|http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> Interrupted: Job no longer active
> DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch exception for
> 'http://xx.xx/xx/xx.pdf'
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted: Job
> no longer active
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job
> no longer active
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
>         ... 1 more
>  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> interruption reported for job 1331716457096 connection 'web': Job no longer
> active
> DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job 1331716457096
> DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> 1331716457096 in need of notification

Mime
View raw message