manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions
Date Mon, 19 Mar 2012 03:55:50 GMT
Karl,


Thanks for your reply.

It seems that Tika failed on extracting documents from PDF files while
crawling web links down. I confirmed there were Tika Exception subsequently
to Solr Exception.

So, Solr detecting Tika Exception sends a status code, 500, then MCF
retries ingesting certain times:

"500 from ingestion request; ingestion will be retried again later"

After all, MCF shuts down the entire job.

I know I should up grade the Solr version (including Tika), to improve a
job in document extraction. But, the current version of Tika still fails in
document extraction sometimes anyway, i feel it would make more sense that
MCF ignores and proceeds after such ingestion error caused by Tika.

Are there any such specification requests from users that MCF ignores and
proceeds after failure of document ingestion caused by Tika, maybe in the
next release?

Are there any options that users can choose to have MCF ignore and proceed
after such ingestion error?


regards,

Shigeki

2012/3/16 Karl Wright <daddywri@gmail.com>

> Hi Shigeki,
>
> A "service interruption" means that a connector (either a repository
> connector like the web connector or an output connector like the Solr
> connector) could not communicate with the configured service.
>
> "Repeated service interruptions" means that certain URLs failed to
> fetch properly even after a pattern of retries which lasted many
> hours.  ManifoldCF connectors deal with such errors in one of several
> ways, depending on the exact details of the error:
>
> - ignore it and proceed
> - retry periodically for some time interval, and then give up and proceed
> - retry periodically for some time interval, and then shut down the job
>
> It sounds like your job has encountered one of the latter errors.  The
> "Error: Repeated service interruptions - failure processing document:
> Ingestion HTTP error code 500" indicates that the problem is due to
> communication with Solr.  Apparently certain documents you are
> indexing are causing Solr to return an error code 500, which is an
> "internal server error", and is usually associated with a Solr
> exception.  You will need to diagnose why this is, and take corrective
> steps, in order for your ManifoldCF job to complete successfully.
>
> "Job no longer active" is harmless - it's a side effect of the job
> shutting down.  When a job is shutting down, active document
> processing cannot always be interrupted within a connector, but the
> framework helps it to stop quickly by throwing this exception.
>
> Thanks,
> Karl
>
>
> 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <shigeki.kobayashi3@g.softbank.co.jp>:
> >
> > I was crawling web sites with links to html and pdf files on the provided
> > multiprocess-example agent for a few hours, then Simple History started
> > showing -104 result code with a message saying "Interrupted: Job no
> longer
> > active".
> >
> > After the same error occurred repeatedly around 40 times, the job status
> > became "Aborting" and then ended up with "Error: Repeated service
> > interruptions
> > - failure processing document: Ingestion HTTP error code 500".
> >
> > The job was interrupted and stopped.
> >
> > Does anyone know what situation brings "Repeated service interruptions"
> and
> > has jobs stopped?
> > Also in what circumstance an error status code -104 occurs? What is the
> > meaning of the code -104?
> >
> > If you have any ideas, please advise me on how to avoid this error.
> >
> >
> > I am using the followings:
> >
> > Solr 1.4 (Extracting Request Handler is set)
> > ManifoldCF 0.4 (multiprocess-example)
> > - Repository connector: WEB
> > - Output connector: Solr
> > Tomcat 6.0.29
> > PostgreSQL 9.1.3
> >
> >
> > Here is MCF’s debug log right before the job was interrupted:
> >
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to
> get
> > connection to http://xx.xx.xx.xx:80 (95697 ms)
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
> > before starting fetch on http://xx.xx.xx.xx:80
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to
> get
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> > HttpClient object
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an
> HttpClient
> > object after 0 ms.
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> > '/xx/xx.pdf'
> > DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> > http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> > DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 128 ms.
> > DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> >  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> > URL|
> http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> > Interrupted: Job no longer active
> > DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch exception
> for
> > 'http://xx.xx/xx/xx.pdf'
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted:
> Job
> > no longer active
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> > Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
> Job
> > no longer active
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
> >         ... 1 more
> >  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> > interruption reported for job 1331716457096 connection 'web': Job no
> longer
> > active
> > DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job
> 1331716457096
> > DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> > 1331716457096 in need of notification
>



-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*

Mime
View raw message