manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Basso Luca <LBa...@Regione.Emilia-Romagna.it>
Subject R: Internal server error (500) causing a crawl interruption
Date Mon, 20 Oct 2014 09:00:49 GMT
Hi Shinichiro,
we found the right configuration just before your suggestion.
Thank you!

Nevertheless, applying "ignoreTikaException" reduces somewhat the problem but doesn't resolve
it completely.
Specifically, the problem still persist for some pdf files (not only for scanned pdf and/or
pdf converted from ms-office documents).
Given that the Tika project is not resolving this issue, we suggest that the problem could
be bypassed at the MCF job or output connector level, 
by means of a specific flag telling the MCF webcrawler to skip "non ok status: 500, message:
Internal Server Error” and keep on crawling.

Dear Karl, can you insert this possibility in the next MCF release?
Thanks a lot, as ever.

Luca


-----Messaggio originale-----
Da: Shinichiro Abe [mailto:shinichiro.abe.1@gmail.com] 
Inviato: martedì 7 ottobre 2014 03:21
A: user@manifoldcf.apache.org
Cc: user@manifoldcf.apache.org
Oggetto: Re: Internal server error (500) causing a crawl interruption

Hi Luca,

Please try to configure ignoreTikaException=true.

  <requestHandler name="/update/extract"
                  class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
    <lst name="defaults">
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
      <str name="uprefix">ignored_</str>
      <str name="captureAttr">true</str>
    </lst>
  </requestHandler>

Regards,
Shinichiro Abe

On 2014/10/06, at 20:15, Karl Wright <daddywri@gmail.com> wrote:

> Hi Luca,
> 
> There is a solr setting which configures Solr Cell to ignore tika errors.  I don't remember
what it is offhand, but you will want to set it properly to disable tika errors.
> 
> Thanks,
> Karl
> 
> 
> On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca <LBasso@regione.emilia-romagna.it> wrote:
> Hi Karl,
> 
> we’re using the Web repository connector in conjunction with the Solr output connector
to crawl a number of web portals (MCF vers. 1.6.1). Unfortunately the crawl job often stops
giving the following error:
> 
> “Repeated service interruptions – failure processing documents: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status: 500, message: Internal Server Error”.
> 
> From the MCF and SOLR logs (which we report hereafter) it seems that the problem is arising
from Tika and apply to various types of documents (.rtf, .pdf, etc.).
> 
> How can we fix it?
> 
> Thank you.
> 
>  
> 
> Best regards,
> 
> Luca
> 
>  
> 
> MCF log:
> 
>  
> 
> WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception during indexing http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service interruption reported for
job 1412340881687 connection 'Webcrawler': Solr exception during indexing http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed: Repeated service
interruptions - failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
> 
> Caused by: org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status:500, message:Internal Server Error
> 
>  
> 
> WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception during indexing http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service interruption reported for
job 1412252016695 connection 'Webcrawler': Solr exception during indexing http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed: Repeated service
interruptions - failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
> 
> Caused by: org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status:500, message:Internal Server Error
> 
>  
> 
> SOLR log:
> 
>  
> 
> 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.10.80.97:9474-2)
null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198:
Illegal IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a
> 
>        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> 
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> 
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> 
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> 
>         at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> 
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> 
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 
>         at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> 
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> 
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> 
>         at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> 
>         at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@6533a82a
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 
>         ... 20 more
> 
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
> 
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206)
> 
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171)
> 
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         ... 23 more
> 
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 2047
> 
>         at java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
> 
>         at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
> 
>         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
> 
>         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
> 
>         at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
> 
>         ... 27 more
> 
>  
> 
> 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.10.80.97:9474-2)
null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285
> 
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> 
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> 
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> 
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> 
>         at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> 
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> 
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 
>         at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> 
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> 
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> 
>         at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> 
>         at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.rtf.RTFParser@73361285
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 
>         ... 20 more
> 
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
> 
>         at org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448)
> 
>         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         ... 23 more
> 
> 

Mime
View raw message