manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shinichiro Abe <shinichiro.ab...@gmail.com>
Subject Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions
Date Mon, 19 Mar 2012 04:13:49 GMT
Hi,

Currently MCF can't ignore 500 server error which is caused by Solr.
If you can upgrade to Solr 3.2, you can specify ignoreTikaException.
https://issues.apache.org/jira/browse/SOLR-2480
Hope that helps.

Regards, 
Shinichiro Abe

On 2012/03/19, at 12:55, Shigeki Kobayashi wrote:

> Karl,
> 
> 
> Thanks for your reply.
> 
> It seems that Tika failed on extracting documents from PDF files while crawling web links
down. I confirmed there were Tika Exception subsequently to Solr Exception. 
> 
> So, Solr detecting Tika Exception sends a status code, 500, then MCF retries ingesting
certain times:
> 
> "500 from ingestion request; ingestion will be retried again later"
> 
> After all, MCF shuts down the entire job.
> 
> I know I should up grade the Solr version (including Tika), to improve a job in document
extraction. But, the current version of Tika still fails in document extraction sometimes
anyway, i feel it would make more sense that MCF ignores and proceeds after such ingestion
error caused by Tika.
>   
> Are there any such specification requests from users that MCF ignores and proceeds after
failure of document ingestion caused by Tika, maybe in the next release?
> 
> Are there any options that users can choose to have MCF ignore and proceed after such
ingestion error? 
> 
> 
> regards,
> 
> Shigeki
> 
> 2012/3/16 Karl Wright <daddywri@gmail.com>
> Hi Shigeki,
> 
> A "service interruption" means that a connector (either a repository
> connector like the web connector or an output connector like the Solr
> connector) could not communicate with the configured service.
> 
> "Repeated service interruptions" means that certain URLs failed to
> fetch properly even after a pattern of retries which lasted many
> hours.  ManifoldCF connectors deal with such errors in one of several
> ways, depending on the exact details of the error:
> 
> - ignore it and proceed
> - retry periodically for some time interval, and then give up and proceed
> - retry periodically for some time interval, and then shut down the job
> 
> It sounds like your job has encountered one of the latter errors.  The
> "Error: Repeated service interruptions - failure processing document:
> Ingestion HTTP error code 500" indicates that the problem is due to
> communication with Solr.  Apparently certain documents you are
> indexing are causing Solr to return an error code 500, which is an
> "internal server error", and is usually associated with a Solr
> exception.  You will need to diagnose why this is, and take corrective
> steps, in order for your ManifoldCF job to complete successfully.
> 
> "Job no longer active" is harmless - it's a side effect of the job
> shutting down.  When a job is shutting down, active document
> processing cannot always be interrupted within a connector, but the
> framework helps it to stop quickly by throwing this exception.
> 
> Thanks,
> Karl
> 
> 
> 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <shigeki.kobayashi3@g.softbank.co.jp>:
> >
> > I was crawling web sites with links to html and pdf files on the provided
> > multiprocess-example agent for a few hours, then Simple History started
> > showing -104 result code with a message saying "Interrupted: Job no longer
> > active".
> >
> > After the same error occurred repeatedly around 40 times, the job status
> > became "Aborting" and then ended up with "Error: Repeated service
> > interruptions
> > - failure processing document: Ingestion HTTP error code 500".
> >
> > The job was interrupted and stopped.
> >
> > Does anyone know what situation brings "Repeated service interruptions" and
> > has jobs stopped?
> > Also in what circumstance an error status code -104 occurs? What is the
> > meaning of the code -104?
> >
> > If you have any ideas, please advise me on how to avoid this error.
> >
> >
> > I am using the followings:
> >
> > Solr 1.4 (Extracting Request Handler is set)
> > ManifoldCF 0.4 (multiprocess-example)
> > - Repository connector: WEB
> > - Output connector: Solr
> > Tomcat 6.0.29
> > PostgreSQL 9.1.3
> >
> >
> > Here is MCF’s debug log right before the job was interrupted:
> >
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to get
> > connection to http://xx.xx.xx.xx:80 (95697 ms)
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
> > before starting fetch on http://xx.xx.xx.xx:80
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to get
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> > HttpClient object
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an HttpClient
> > object after 0 ms.
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> > '/xx/xx.pdf'
> > DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> > http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> > DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 128 ms.
> > DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> >  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> > URL|http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> > Interrupted: Job no longer active
> > DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch exception for
> > 'http://xx.xx/xx/xx.pdf'
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted: Job
> > no longer active
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
> >         at
> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> > Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job
> > no longer active
> >         at
> > org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
> >         ... 1 more
> >  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> > interruption reported for job 1331716457096 connection 'web': Job no longer
> > active
> > DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job 1331716457096
> > DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> > 1331716457096 in need of notification
> 
> 
> 
> -- 
> ~~~~~~~~~~~~~~~~~~~~~~~~
>  ソフトバンクモバイル株式会社
>  情報システム本部
>  システムサービス事業統括部
>  サービス企画部
>  
>  小林 茂樹
>  shigeki.kobayashi3@g.softbank.co.jp
> ~~~~~~~~~~~~~~~~~~~~~~~~
>  
> 
> 


Mime
View raw message