manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shinichiro Abe <shinichiro.ab...@gmail.com>
Subject Re: Internal server error (500) causing a crawl interruption
Date Tue, 07 Oct 2014 01:20:38 GMT
Hi Luca,

Please try to configure ignoreTikaException=true.

  <requestHandler name="/update/extract"
                  class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
    <lst name="defaults">
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <bool name="ignoreTikaException">true</bool>
      <str name="uprefix">ignored_</str>
      <str name="captureAttr">true</str>
    </lst>
  </requestHandler>

Regards,
Shinichiro Abe

On 2014/10/06, at 20:15, Karl Wright <daddywri@gmail.com> wrote:

> Hi Luca,
> 
> There is a solr setting which configures Solr Cell to ignore tika errors.  I don't remember
what it is offhand, but you will want to set it properly to disable tika errors.
> 
> Thanks,
> Karl
> 
> 
> On Mon, Oct 6, 2014 at 7:08 AM, Basso Luca <LBasso@regione.emilia-romagna.it> wrote:
> Hi Karl,
> 
> we’re using the Web repository connector in conjunction with the Solr output connector
to crawl a number of web portals (MCF vers. 1.6.1). Unfortunately the crawl job often stops
giving the following error:
> 
> “Repeated service interruptions – failure processing documents: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status: 500, message: Internal Server Error”.
> 
> From the MCF and SOLR logs (which we report hereafter) it seems that the problem is arising
from Tika and apply to various types of documents (.rtf, .pdf, etc.).
> 
> How can we fix it?
> 
> Thank you.
> 
>  
> 
> Best regards,
> 
> Luca
> 
>  
> 
> MCF log:
> 
>  
> 
> WARN 2014-10-03 17:00:53,982 (Worker thread '37') - Solr exception during indexing http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> WARN 2014-10-03 17:00:53,985 (Worker thread '37') - Service interruption reported for
job 1412340881687 connection 'Webcrawler': Solr exception during indexing http://www.regione.emilia-romagna.it/entra-in-regione/polo-archivistico-regionale/archivio-storico/per-approfondire/BolognaArchivioTerritoriale.rtf/at_download/file/BolognaArchivioTerritoriale.rtf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> ERROR 2014-10-03 17:00:53,998 (Worker thread '37') - Exception tossed: Repeated service
interruptions - failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
> 
> Caused by: org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status:500, message:Internal Server Error
> 
>  
> 
> WARN 2014-10-03 18:05:22,636 (Worker thread '0') - Solr exception during indexing http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> WARN 2014-10-03 18:05:22,638 (Worker thread '0') - Service interruption reported for
job 1412252016695 connection 'Webcrawler': Solr exception during indexing http://territorio.regione.emilia-romagna.it/codice-territorio/semplificazione-edilizia/non-rue/dm_9_5_2001.pdf/at_download/file/dm_9_5_2001.pdf
(500): Server at http://vm97lnx:9474/solr/rerweb5 returned non ok status:500, message:Internal
Server Error
> 
> ERROR 2014-10-03 18:05:22,649 (Worker thread '0') - Exception tossed: Repeated service
interruptions - failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned
non ok status:500, message:Internal Server Error
> 
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure processing document: Server at http://vm97lnx:9474/solr/rerweb5 returned non ok
status:500, message:Internal Server Error
> 
> Caused by: org.apache.solr.common.SolrException: Server at http://vm97lnx:9474/solr/rerweb5
returned non ok status:500, message:Internal Server Error
> 
>  
> 
> SOLR log:
> 
>  
> 
> 8:05:10,908 ERROR [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.10.80.97:9474-2)
null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: TIKA-198:
Illegal IOException from org.apache.tika.parser.pdf.PDFParser@6533a82a
> 
>        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> 
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> 
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> 
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> 
>         at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> 
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> 
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 
>         at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> 
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> 
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> 
>         at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> 
>         at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@6533a82a
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 
>         ... 20 more
> 
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:244)
> 
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1206)
> 
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1171)
> 
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:124)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         ... 23 more
> 
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 2047
> 
>         at java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:762)
> 
>         at java.lang.StringBuilder.deleteCharAt(StringBuilder.java:258)
> 
>         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1000)
> 
>         at org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808)
> 
>         at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1241)
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:558)
> 
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:188)
> 
>         ... 27 more
> 
>  
> 
> 17:00:42,273 ERROR [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.10.80.97:9474-2)
null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.rtf.RTFParser@73361285
> 
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> 
>         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> 
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> 
>         at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
> 
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
> 
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
> 
>         at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
> 
>         at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
> 
>         at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
> 
>         at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:165)
> 
>         at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
> 
>         at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> 
>         at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> 
>         at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
> 
>         at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
> 
>         at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
> 
>         at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
> 
>         at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.rtf.RTFParser@73361285
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 
>         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> 
>         ... 20 more
> 
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
> 
>         at org.apache.tika.parser.rtf.TextExtractor.processControlWord(TextExtractor.java:872)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.parseControlWord(TextExtractor.java:566)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.parseControlToken(TextExtractor.java:492)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:459)
> 
>         at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:448)
> 
>         at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:56)
> 
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 
>         ... 23 more
> 
> 


Mime
View raw message